A new family of global protein shape descriptors - bio-structural ...

5 downloads 108513 Views 322KB Size Report
b QUT, The Quantum Protein Center, Department of Physics, Technical .... One can make a crude classification of protein domains into what we call super fold ...... Acd. Sci. USA 48 (1998) 4355. [15] X.-S. Lin, Z. Wang, Integral geometry of ...
Mathematical Biosciences 182 (2003) 167–181 www.elsevier.com/locate/mbs

A new family of global protein shape descriptors Peter Røgen

a,*,1

, Henrik Bohr

b,2

a

b

Department of Mathematics, Technical University of Denmark, Building 303, DK-2800 Kongens Lyngby, Denmark QUT, The Quantum Protein Center, Department of Physics, Technical University of Denmark, Building 303, DK-2800 Kongens Lyngby, Denmark Received 28 June 2002; received in revised form 7 October 2002; accepted 7 November 2002

Abstract A family of global geometric measures is constructed for protein structure classification. These measures originate from integral formulas of Vassiliev knot invariants and give rise to a unique classification scheme. Our measures can better discriminate between many known protein structures than the simple measures of the secondary structure content of these protein structures. Ó 2003 Elsevier Science Inc. All rights reserved. Keywords: Protein structure classification; Writhe; Average crossing number; Gauss integrals; Crossing configuration

1. Introduction In this paper we address the issue of finding global measures which can geometrically characterize and therefore classify the 3-dimensional structures of proteins. One reason for searching global geometric measures is that, in [1] it is proven that, if protein backbones are represented as framed space curves, then any local geometric description, as e.g. curvature and torsion, is discontinuous and thus unsuited for describing protein structures. The most obvious classification of proteins is derived from the primary sequences of there amino acids and, although not fully automatized and uniquely defined, this is performed by utilizing various alignment schemes. As long as one can successfully carry out a multiple alignment

*

Corresponding author. Tel.: +45-4525 3044/3031; fax: +45-4588 1399. E-mail addresses: [email protected] (P. Røgen), [email protected] (H. Bohr). 1 The work was supported by a DTU-grant and by a grant from Carlsbergfondet. 2 The work was supported by Grundforskningsfonden. 0025-5564/03/$ - see front matter Ó 2003 Elsevier Science Inc. All rights reserved. doi:10.1016/S0025-5564(02)00216-X

168

P. Røgen, H. Bohr / Mathematical Biosciences 182 (2003) 167–181

of the various primary sequences, one can define a similarity measure by counting the number of identical residue matches that arise from the most optimal alignment. However, such classification does not necessarily have anything to do with the similarity of the protein 3-dimensional structures. Thus one can define protein families and super families based on the proteins sequence identity by matching pairs in the alignments. The concept that concerns protein 3-dimensional structural homology is often termed protein folds and protein fold classification. It is done partly by visual inspection using computer graphics [2,3], visualizing the 3-dimensional protein structure from the Cartesian coordinates given in the crystallographic data bases [4]. From the NMR-structures one has also to deal with the multiple backbone structures that arise from that methodology. An important element in the identification of protein structures from computer graphics is the periodic secondary structure elements, the helices and the sheets, and turns, that appear as Ôland-marksÕ that often are high-lighted by ribbon representations. The basic problems from these graphical representations are the difficulties with going from a structure in 3-dimensions to various 2-dimensional projections seen on the computer screen. It is crucial to choose the right projections in order to get the right understanding of the ÔtopologyÕ of the secondary structures, i.e., to determine how the various secondary structures are connected. The problem with all the possible 2-dimensional projections, made us think off a rigorous mathematical way of measuring the 3-dimensional structure that models observation of planar projections. This led us to use geometric measures, the writhe and generalizations there of, known from knot theory. Examples of such work are found in Ref. [5] and the references within this paper, in which the possibility of seeing n crossings of a backbone from an arbitrary chosen direction in space is considered. Here we will not consider crossing number frequency distributions. Instead the measures we introduce count the number of times any given configuration of crossings is seen in the case of an almost planar curve. The two simplest of the structural measures we consider are called the writhe and the average crossing number and have previously been applied to analyse protein structures. Levitt use the writhe to distinguish different chain threadings in Ref. [6]. Arteca and Tapia use the average crossing number and the most probable overcrossing number as protein shape descriptors in Ref. [7]. We shall first discuss the various phenomenological classification schemes for protein structures and then turn to the knot theoretically inspired treatments of the protein backbones.

2. A phenomenological look at protein folds The individual 3-dimensional protein structures determined by NMR and X-ray crystallography can be grouped into a smaller number of characteristic structural classes. These structural classes consist of domains from homologous proteins with similar Ôchain topologicalÕ configurations of their backbone atoms [8]. Classification of protein structures have given rise to several useful databases, such as SCOP [3] and CATH [2]. These structural domains, or the so-called folds of the proteins, were introduced in order to clarify the notion of structural similarity. Such fold classes could contain entire proteins or well-defined sub-domains of proteins depending on the length of and the complexity of the given protein.

P. Røgen, H. Bohr / Mathematical Biosciences 182 (2003) 167–181

169

Pascarella and Argos [8] have used chain topological similarity as a measure of fold class homology, while Holm and Sander [9] have used similarity of distance matrices to determine fold class membership. Orengo et al. [10] have reported a classification of all proteins in the protein structural database into more than 100 folds from structural comparison. Common to these studies, except perhaps for that of Holm and Sander, is that they define fold classes in a rather ad hoc manner. They do not provide a unique way of determining what fold class an entirely new protein belongs to. In the case of Holm and Sander there is a procedure for determining fold class provided the proteins have well structured distance matrices (i.e., with clearly separated subdomains). While it is feasible to determine membership of a fold class once the 3-dimensional structure of the protein is constructed, earlier efforts to predict fold classes only from sequences have achieved little success [11]. In the elaborate methods of Jones and co-workers [12] the fold class membership for a protein is determined by fitting primary as well as tertiary structures. Similarly, the methods of Sippl [13] and Gribskov et al. [14] make use of primary structure fitting through multiple alignment, potentials and profiles. Successful fold class predictions from sequences are those cases where there is significant sequence homology between the protein whose structure is to be determined and the one whose structure is established. Most frequently, sequences which have large homology mutually are known to belong to the same fold class. There are, however, cases where the members of particular fold classes have little sequence homology mutually within a class. For example, the proteins adenosine deaminase (1add), aldolase A (1ald), aldose reductase (1ads), the first domain of cyclodextrin glycosyltransferase (1cdg), beta-amylase (1btc), endo-1,4-beta-D -glucanase (1tml), the second domain of chloromuconate cycloisomerase (1chr.A), etc. [14], all belong to the ÔbarrelÕ class and the sequence homology between any pair of these is insignificant. Their structural homology mutually, are of course large. In most definitions of fold classes, each member would have more than 50% sequence identity to each other, although domains with far less sequence similarity could belong to the same class. It is important that each protein within a class would have a structure with a large topological similarity and a similar packing pattern to other members of the class. The details of the primary sequence in itself are less important. The notion of fold classes is important for predicting new protein structures using homology modeling. In homology modeling an unknown 3-dimensional protein structure is inferred from other known 3dimensional protein structures whose amino acid sequences are similar to the sequence of the protein in question. One can make a crude classification of protein domains into what we call super fold classes by simply distinguishing them from their content of secondary structures. Such a super classification might actually also turn out to be deeply connected to the folding process and could also give rise to a measure of distance among the fold classes in the way that folds most different in secondary structure content, are most far apart. We define thus four superclassÕs being: (1) the class of pure alpha helices (denoted a), (2) the class with only beta sheets (denoted b), (3) the class with alpha helices and beta sheets clearly separated (written a þ b) and finally, (4) the class of folds having alpha helices and beta sheets entangled (denoted a  b). These four classes are very well illustrated by the four prototypical proteins shown in Figs. 1 and 2.

170

P. Røgen, H. Bohr / Mathematical Biosciences 182 (2003) 167–181

Fig. 1. Typical members from the super fold class a (256B:A left) and b (1ACX right).

Fig. 2. Typical members from the super fold class a þ b (1PAZ––left) and a  b (3TIM:A––right).

3. Motivation for the measures The methodology we propose for classifying space curves representing proteins is, as mentioned above, concerned with crossings seen in planar projections of the curves. In Knot Theory such planar projections are called knot diagrams and serve as a fundamental tool for dealing with knots. Here Knot Theory can only be a matter of inspiration since knots are defined on closed curves while we treat open curves for proteins. Never the less, it turned out to be useful to employ a family of integrals for open curves that are inspired by generalized Gauss integrals involved in formulas for the so called Vassiliev knot invariants. Let us therefore introduce a few concepts from Knot Theory. A knot is a closed curve without self-intersections in 3-space. A knot is said to be oriented if equipped with a preferred direction of traversion. A plane projection of a knot is called a shadow of the knot. Indicating over and under crossings on a shadow of a knot, as done on Fig. 3, a knot diagram of the knot is obtained, provided that the shadow has only transversal double points. That is, on the shadow triple points are not allowed and at each double point the two tangents of the curve are not parallel. The crossings of an oriented knot diagram may be assigned with a sign

P. Røgen, H. Bohr / Mathematical Biosciences 182 (2003) 167–181

171

Fig. 3. Left: a knot diagram of the right handed Trefoil knot. Right: the signs of crossing according the the usual righthand rule.

using the usual right-hand rule. This is also shown on Fig. 3. As the signs of the crossings are independent under change of the orientation of the knot, this gives signs of crossings in unoriented knot diagrams. The first structural measure to be considered here is the writhe of a space curve, known from the famous C alug areanu–Pohl–White self-linking formula. The writhe, Wr, of a closed space curve, c, may be calculated using the Gauss integral Z Z 1 xðt1 ; t2 Þ dt1 dt2 ; WrðcÞ ¼ 4p ccnD where xðt1 ; t2 Þ ¼

½c0 ðt1 Þ; cðt1 Þ cðt2 Þ; c0 ðt2 Þ jcðt1 Þ cðt2 Þj

3

and D is the diagonal of c  c. Notice that, the triple scalar product ½c0 ðt1 Þ; cðt1 Þ cðt2 Þ; c0 ðt2 Þ equals the oriented volume of the parallelepiped spanned by c0 ðt1 Þ, cðt1 Þ cðt2 Þ, and c0 ðt2 Þ. Hereby xðt1 ; t2 Þ ¼ xðt2 ; t1 Þ. As we may assume that c is parametrized by ½0; 1 , it is enough R to calculate the 2 above integral on the 2-simplex D ¼ fðt1 ; t2 Þ; 0 < t1 < t2 < 1g. Setting Ið1;2Þ ¼ D2 xðt1 ; t2 Þ dt1 dt2 , we have Wr ¼ ð1=2pÞIð1;2Þ . Let e be a unit vector in 3-space and let pe ðcÞ be the projection of the closed curve c along e. For almost all e, the planar curve pe ðcÞ gives a knot diagram of c. The writhe of a knot diagram is the sum of the signs of the crossings in the knot diagram. A well-known geometric interpretation of the writhe of a closed space curve is, that 4p Wr equals the integral over all projections of the writhe of the knot diagrams. That is, Wr is the average number of signed crossings seen, when looking at the knot from all directions in 3-space. To explain this, let eðt1 ; t2 Þ be the unit vector from cðt1 Þ toR cðt R 2 Þ. Then xðt1 ; t2 Þ is the pull back of the area 2-form on the unit 2-sphere. Hence, the integral ccnD xðs1 ; s2 Þ ds1 ds2 measures the signed area on the unit 2-sphere sweeped out by eðt1 ; t2 Þ. For any t1 and t2 6¼ t1 the projection along eðt1 ; t2 Þ shows a crossing corresponding to the two points cðt1 Þ and cðt2 Þ. This crossing has the same sing as xðs1 ; s2 Þ. Taking the absolute value of the integrand in the writhe integral one gets the average crossing number, that is the average number of crossings (without signs) seen, when looking at the knot from all directions in 3-space. By analogy to the above, we introduce the integral Ij1;2j ¼ R jxðt 2 1 ; t2 Þj dt1 dt2 . D Note, that both the writhe and the average crossing number by there pointwise counting of crossings have the same geometricR interpretations if applied to open curves. Consider the integral Ið1;3Þð2;4Þ ¼ D4 xðt1 ; t3 Þxðt2 ; t4 Þ dt1 dt2 dt3 dt4 , where D4 is the 4-simplex given by 0 < t1 < t2 < t3 < t4 < 1. This integral is involved in a integral formulation of a Vassiliev knot

172

P. Røgen, H. Bohr / Mathematical Biosciences 182 (2003) 167–181

invariant of second order and comes from the perturbative expansion of WittenÕs Chern–Simons path integral associated with a knot in 3-space. The geometric interpretation of Ið1;3Þð2;4Þ given in [15], in the case of an almost planar space curve, can be formulated as follows: consider a knot diagram with two crossings having the property that the parameter-values 0 < t1 < t2 < t3 < t4 < 1 corresponding to these two crossings fulfill that t1 and t3 corresponds to one of the two crossings and t2 and t4 corresponds to the other of the two crossings. Denote such a crossing configuration by ð1; 3Þð2; 4Þ. This crossing configuration is counted as occurring once if both crossings are positive or if they both are negative. The crossing configuration is counted as a negative occurrence once if the two crossings have opposite signs. In the limit where the knot gets planar the integral Ið1;3Þð2;4Þ converges to a constant times the number of crossings (without signs) plus a constant times the number of occurrences of the crossing configuration ð1; 3Þð2; 4Þ, counted with sign. A heuristic geometric-statistical explanation why Ið1;3Þð2;4Þ is strongly connected to the frequency of ð1; 3Þð2; 4Þ crossing configurations and to the frequency of crossings is: Fix t1 and t3 . Hereby eðt1 ; t3 Þ is fixed. When t2 and t4 vary, the vector eðt2 ; t4 Þ sweeps out an area on the unit 2-sphere. If this area e.g. is 7.3 times the area of the unit 2-sphere then the equation eðt2 ; t4 Þ ¼ eðt1 ; t3 Þ is expected to be fulfilled 7.3 times in average if eðt2 ; t4 Þ and eðt1 ; t3 Þ are uncorrelated. However, in the limit t2 ! t1 and t4 ! t3 the vectors eðt2 ; t4 Þ and eðt1 ; t3 Þ get equal. Hence, one would expect the equation eðt2 ; t4 Þ ¼ eðt1 ; t3 Þ to be fulfilled more than the uncorrelated expectation, which is in perfect agreement with the limit of the integral found in [15] for almost planar space curves. Note, that the above geometric-statistical argument does not depend on the fact that the curve is closed. For a configuration of n crossings given by an ordered pairing of the integers 1; . . . ; 2n of the form ði1 ; i2 Þði3 ; i4 Þ; . . . ; ði2n 1 ; i2n Þ with i1 < i2 , i3 < i4 ; . . . ; i2n 1 < i2n , and i1 < i3 <    < i2n 1 we define the integral 3 Z Iði1 ;i2 Þði3 ;i4 Þði2n 1 ;i2n Þ ¼ xðti1 ; ti2 Þxðti3 ; ti4 Þ    xðti2n 1 ; ti2n Þ dt1    dt2n ; D2n

where D2n is the 2n-simplex given by 0 < t1 < t2 <    < t2n < 1. We refer to the number Iði1 ;i2 Þði3 ;i4 Þði2n 1 ;i2n Þ as a geometric measure of order n. Note, that all of these measures are independent of translation, rotation, and scale. Furthermore, they all depend continuously on deformation of the curve. An argument analogous to the above geometric-statistical argument concerning the integral Ið1;3Þð2;4Þ applies also to the general integrals Iði1 ;i2 Þði3 ;i4 Þði2n 1 ;i2n Þ . That is, we expect the geometric measure of order n denoted by Iði1 ;i2 Þði3 ;i4 Þði2n 1 ;i2n Þ to be correlated with the average occurrence of the crossing configuration ði1 ; i2 Þði3 ; i4 Þ    ði2n 1 ; i2n Þ, see Fig. 4, plus constants times the geometric measures of lower order. The main goal of this paper is to introduce the above integrals as measures of protein structure or rather discrete versions applying to polygonal curves. We will not elaborate on the validity of the above heuristic geometric-statistical argument.

It is more natural to define the integrals times ð2=4pÞn , where 4p normalizes the area of the unit 2-sphere and the factor of two takes into account that a crossing seen in one direction is also seen from the antipodal direction. As we are going to rescale the values of the integrals later, we have avoided this multiplicative constant. 3

P. Røgen, H. Bohr / Mathematical Biosciences 182 (2003) 167–181

173

Fig. 4. Left: a knot diagram of a helix. Right: a corresponding crossing configuration.

4. Evaluation on polygonal curves In the following we shall derive reductions of the generalized Gauss integrals introduced. We have decided to represent proteins by the polygonal curve connecting the Ca -atoms. This representation is not faithful as the positions of three Ca -atoms in sequence in general can be realized by two different pairs of dihedral angles associated to the middle Ca -atom (see [1]). From a structural point of view, this local ambiguity is unimportant as the large scale shape of the protein backbone exactly is unchanged and is the main issue of consideration here. We start by finding an expression for Ið1;2Þ evaluated on two line segments. We found numerical integration to slow, but have used is to test the explicit formula given below. Let l1 be the line segment from r11 to r12 and let l2 be the line segment from r21 to r22 . We need to find the signed area on the unit 2-sphere of the domain, D, sweeped out by the unit vectors, e, starting at points on l1 with directions to points on l2 . First consider the boundary, oD, of the domain D. This boundary consists of four great circle segments. The reason for this is as follows: consider the unit vectors, denoted by eðr11 ; l2 Þ, that starts at r11 with directions to the points of l2 . If the point r11 does not lie on the line defined by l2 , then r11 and l2 define a plane. Hereby, eðr11 ; l2 Þ lies on the intersection between a plane and the unit 2-sphere. The set eðr11 ; l2 Þ is thus a part of a great circle on the unit 2-sphere. The line l1 lies on one side of the plane through r11 and l2 . Hence, the domain D is contained in one of the two hemispheres that the above plane divides the unit 2-sphere into. We thus conclude that the domain D is convex and is contained in a hemisphere. The Gauss–Bonnet theorem asserts that: if R is a simply connected region in a surface bounded exterior angles 1 ; 2 ; . . . ; n by a piecewise differentiableP curveR a, with pieces R R a1 ; a2 ; . . . ; an , making Pn n at the vertices of a, then k ðsÞ ds þ K dA ¼ 2p

 , where kg is the geodesic g j j¼1 aj j¼1 R curvature and K is the Gauss curvature. The unit 2-sphere has Gauss curvature one everywhere and great circles have zero geodesic curvature everywhere. Therefore the Gauss–Bonnet theorem becomes Z Z 4 X 1 dA ¼ 2p

j : A¼ R

j¼1

By convexity, the domain D is always a simply connected region. If the boundary of D is traversed in the positive P4 direction, then the inclosed area is directly given by the Gauss–Bonnet of D is traversed in the negative direction the area theorem as A ¼ 2p j¼1 j . If the boundary P4 of the complement to D is Ac ¼ 2p j¼1 j , by the Gauss–Bonnet theorem. The area of the unit

174

P. Røgen, H. Bohr / Mathematical Biosciences 182 (2003) 167–181

P 2-sphere is 4p. Hence the positive area of D is P 4p Ac ¼ 2p þ 4j¼1 j . However D is negatively 4 traversed, so the signed area of D is A ¼ 2p j¼1 j . The two expressions for the signed area of D are finally collected in the formula ! 4 4 X X j 2p

j : A ¼ sign j¼1

j¼1

To obtain the desired expression for Ið1;2Þ we need to find the exterior angles 1 ; . . . ; 4 given two line segments l1 and l2 . Let a, b, and c be three unit vectors. As we may assume that a, b, and c lie on the same open hemisphere the great circle pieces connecting a to b to c are well-defined. Denote the exterior angle at b by \ða; b; cÞ. Standard vector calculus gives that cosð\ða; b; cÞÞ ¼

ab bc ða  bÞðb  cÞ ða  cÞðb  bÞ ða  bÞðb  cÞ ða  cÞ  ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; ja  bj jb  cj 1 ða  bÞ2 1 ðb  cÞ2 1 ða  bÞ2 1 ðb  cÞ2

a  b b  ðb  cÞ ab ðb  cÞb ðb  bÞc  ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ja  bj jb  cj 2 2 1 ða  bÞ 1 ðb  cÞ ða  bÞ  c ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : 2 2 1 ða  bÞ 1 ðb  cÞ

sinð\ða; b; cÞÞ ¼

Let p1 ; . . . ; pL denote the vertices of a polygonal line, let lj denote the line segment from pj to pjþ1 , and let xði; jÞ denote the value of Ið1;2Þ when restricted to the line segments li and lj . The unit vector with direction from pi to pj is denoted eði; jÞ. The sum of the exterior angles is then, following the positive orientation ði; jÞ 7! ði þ 1; jÞ 7! ði þ 1; j þ 1Þ 7! ði; j þ 1Þ 7! ði; jÞ 7!    of the boundary of the planar square with the same points as corners, given by Sði; jÞ ¼

4 X

k ¼ \ðeði; jÞ; eði þ 1; jÞ; eði þ 1; j þ 1ÞÞ þ \ðeði þ 1; jÞ; eði þ 1; j þ 1Þ; eði; j þ 1ÞÞ

k¼1

þ \ðeði þ 1; j þ 1Þ; eði; j þ 1Þ; eði; jÞÞ þ \ðeði; j þ 1Þ; eði; jÞ; eði þ 1; jÞÞ: And we finally find xði; jÞ ¼ signðSði; jÞÞ2p Sði; jÞ for 0 < i < j 1 < L. If j ¼ i þ 1 then the two line segments li and liþ1 lie in a plane and xði; i þ 1Þ ¼ 0. The 2n-double integral Iði1 ;i2 Þði3 ;i4 Þði2n 1 ;i2n Þ is now replaced by the 2n-double sum of products of the xði; jÞs X Iði1 ;i2 Þði3 ;i4 Þði2n 1 ;i2n Þ ¼ xðji1 ; ji2 Þxðji3 ; ji4 Þ    xðji2n 1 ; ji2n Þ: 0