Simple sequence is abundant in eukaryotic proteins - Wiley Online ...

12 downloads 250 Views 153KB Size Report
Abstract: All proteins of Saccharomyces cerevisiae have been compared to ... hydrophobic proteins or fibrous proteins (Creighton, 1993). With. 20 residues ...
Protein Science ~1999!, 8:1358–1361. Cambridge University Press. Printed in the USA. Copyright © 1999 The Protein Society

FOR THE RECORD

Simple sequence is abundant in eukaryotic proteins

G.B. GOLDING Department of Biology, McMaster University, Hamilton, Ontario L8S 4K1, Canada ~Received November 2, 1998; Accepted March 2, 1999!

Abstract: All proteins of Saccharomyces cerevisiae have been compared to determine how frequently segments from one protein are present in other proteins. Proteins that are recently evolutionarily related were excluded. The most frequently present protein segments are long, tandem repetitions of a single amino acid. For some of these segments, up to 14% of all proteins in the genome were found to have similar peptides within them. These peptide segments may not be functional protein domains. Although they are the most common shared feature of yeast proteins, their ubiquity and simplicity argue that their probable function may be to simply serve as spacers between other protein motifs. Keywords: protein structure; polyamino acids; repeats; yeast

On a large scale most proteins are composed of similar frequencies of the 20 amino acids. Some proteins with unusual structures will differ from these characteristic frequencies, such as particularly hydrophobic proteins or fibrous proteins ~Creighton, 1993!. With 20 residues possible at each site and most proteins composed of hundreds of amino acids, a random sequence of residues would still be unique. But functional constraints dictate that many proteins must accomplish similar tasks, such as binding to DNA or binding to ATP, and these functions are often accomplished by more or less distinct domains of amino acids. Similar domains in different proteins can be created by the duplication of genes followed by the subsequent fusion of coding sequences ~Ohno, 1987!. Similar domains in distinct proteins can also be created via either homologous or nonhomologous recombination. Still another common suggestion is that these domains might have originated via exon shuffling ~Gilbert, 1978!. Indeed it has been suggested that an original advantage of introns was to facilitate the shuffling of exons ~Dorit et al., 1990; de Souza et al., 1996; Gilbert et al., 1997!. By whatever mechanism these domains were created, their existence has been well established in all organisms from bacteria to man. The number of such domains, their frequencies, and many of their properties are less well known ~Doolittle, 1995!. Reprint requests to: G.B. Golding, Department of Biology, McMaster University, Hamilton, Ontario L8S 4K1, Canada; e-mail: Golding @McMaster.CA.

Over evolutionary time these domains diverge in sequence and can become difficult to recognize. We wished to determine the most common protein blocks in a eukaryotic genome. But the proteins sequenced from a particular organism are often biased toward genes that have been previously well characterized. A complete genome can be used to avoid any possible bias in the sampling of sequences. The complete sequencing of a eukaryotic genome is a feat recently accomplished for Saccharomyces cerevisiae ~Goffeau et al., 1996, 1997; Mewes et al., 1997!. All of the protein sequences from S. cerevisiae were collected from the databases. These totaled 14,914 entries. This number of entries is much larger than the estimated number of proteins in the sequenced genome ~Goffeau et al., 1996! due to the redundancy in the databases. All sequences were included since there are large discrepancies between the sequences reported for some proteins. In addition, there are differences among strains that are not included within the single sequenced genome. Some of the proteins in the yeast genome are known to be ancient duplications ~Wolfe & Shields, 1997!. These proteins will have similar protein segments due to their recent shared ancestry. To eliminate these proteins and to eliminate the redundant duplicates in the databases, all proteins were pairwise aligned. Any entries with more than 20% identity throughout their length were eliminated. In addition, any entries with the keywords “partial” or “fragment” were excluded. A total of 5,459 protein database entries remain. For each of these entries, overlapping segments of no more than 100 amino acids were constructed. For proteins longer than 100 amino acids, segments that overlap by 20 amino acids were constructed. There are a total of 89,826 such overlapping segments. For each of these, all 5,459 yeast proteins were searched for similar peptides using the BLAST algorithm ~Altschul et al., 1990!. A PAM250 matrix without repeat filtering was used. The number of distinct proteins with a similar segment was recorded. A frequency histogram of the number of distinct proteins with a similar segment is shown in Figure 1. A segment is considered to be similar if it has a significance level of less than 0.05 according to the BLAST algorithm. Figure 1 shows that most protein segments are unique or are present in only a small number of proteins. There were 25 segments that did not find any other similar segments among any proteins. This is possible because these segments consist of only a few amino acids ~in some cases just six! and these are insufficient to yield a 0.05 cutoff. There were 42% of the

1358

1359

Simple protein sequence

Fig. 1. The distribution of similar peptides in the yeast genome.

segments that were present in only one protein and 90% of all segments are present in 10 or fewer other proteins. While most segments are present in only a few proteins, there are some segments present in many other proteins. The most common segment occurs in ~and0or has similar peptides in! 754 other proteins. This is 14% of all yeast proteins. This segment is shown in Table 1 under Class a. It contains 51 serine residues, 27 glutamic acid residues, 10 lysine residues, 8 aspartic acid residues, and 4 others. This segment is contained within the SW-NSR1 ~YGR159c, ACC P27476! protein and identified as a nuclear localization sequence binding protein. An examination of the other segments that are also found in many other proteins finds that most of them are also very rich in serine residues. To eliminate these protein segments and to determine the nature of the remainder of the genome proteins, a polyserine segment of 100 amino acids was constructed. This was then used to eliminate any protein segment significantly similar to a simple polyserine repeat. Removing all of these entries yields a second class of most common segments. The most frequent peptide of this second class occurs in ~and0or has significantly similar peptides in! 428 different proteins. This

constitutes 8% of all proteins. The segment contains no serine residues, but has 45 glutamic acid residues, 40 lysine residues, 6 asparagine acid residues, and 9 others, and has an alternating E 4K 4 pattern. This segment is contained within the SW-YKU1 ~YKL201c, ACC P36043! protein and is identified as a hypothetical protein in the tor2-pas1 intergenic region. Together these segments rich in poly-S and poly-E are present in a total of 16% of all proteins. A similar examination of the other most frequently present segments in this class shows that they, too, are all very rich in glutamic acid residues. A polyglutamic acid segment of 100 amino acids was therefore constructed to eliminate all such segments that show similarity to a simple poly-E repeat. Removing all of these entries yields the third most common class ~Table 1 under Class c!. This is a segment from SW-YGG6 rich in poly-D with 26 D, 17 N, and 57 other residues. In the same manner, the fourth most common is from SW-ADR6, rich in poly-Q with 30 Q, 23 N, and 47 other residues and the fifth most common is from PIR-S61046, rich in poly-N with 53 N, 9 I, and 39 other residues. A total of 21% of all proteins has a segment that has significant similarity to either poly-S, poly-E, poly-D, poly-Q, or poly-N. Not all proteins that contain these segments are immediately apparent to the eye since via a PAM matrix functionally equivalent amino acids will score nearly as high as the identical amino acid. Amazingly, there are 11 proteins that show significant similarity to all of S100 , E 100 , D100 , Q 100 , and to N100! As might be expected, some of these are very long proteins but others are comparatively short proteins ~ranging from YOL123w 2934 a.a and SW-FAB1 2278 a.a., . . . down to . . . PIR-S50977 431 a.a. and SW-OPI1 404 a.a. residues!. There appears to be no common theme that links these 11 proteins. For segments of other lengths, qualitatively similar results are found. For example, 50’mers rather than 100’mers again show that tandemly repetitive segments are found to be most frequent, but the number of unique fragments and the relative order of the primary amino acids is changed. The number of relatively unique segments falls as might be expected for smaller lengths ~Fig. 2!. Still the vast majority of segments ~90% of 89,698! occur within fewer than 16 other proteins using the 0.05 cutoff value. The maximum number of proteins with similar segments is also drastically reduced. The composition of the most frequent segment is somewhat altered and consists mainly of repeats of asparagine, glutamic acid,

Table 1. The frequency of the most common classes of 100 mer’s in yeast Class

Frequency

Protein

Sequence

a

754

SW-NSR1

SSSESESESESESESSSSSSSSDSESSSSSSSDSESEAETKKEESKDSSS SSSDSSSDEEEEEEKEETKKEESKESSSSDSSSSSSSDSESEKEESNDKK

b

428

SW-YKU1

KKEEEEKKKKEEEEKKKKEEEEKKKKEEEEKKKQEEEEKKKKEEEEKKKQ EEGEKMKNEDEENKKNEDEEKKKNEEEEKKKQEEKNKKNEDEEKKKQEEE

c

238

SW-YGG6

CGKPLALTAIVDHLENHCAGASGKSSTDPRDESTRETIRNGVESTGRNNN DDDNSNDNNNDDDDDDDNDDNEDDDDADDDDDNSNGANYKKNDSSFNPLK

d

210

SW-ADR6

NNNNSNNHNMRNNSNNKTSNNNNVTAVPAATPANTNNSTSNANTVFSERA AMFAALQQKQQQRFQALQQQQQQQQNQQQQNQQPQQQQQQQQNPKFLQSQ

e

199

PIR-S61046

NMAPSNSGSPIIIADHFSGNNNIAPNYRYNNNINNNNNNINNMTNNRYNI NNNINGNGNGNGNNSNNNNNHNNNHNNNHHNGSINSNSNTNNNNNNNNGN

1360

G.B. Golding

Fig. 2. The distribution of similar peptides in the yeast genome for length 50aa.

and aspartic acid. The most common 50’mer is present in 451 proteins and is from the SW-YGP0 locus ~Table 2!. This segment is not as serially repetitive as some of the other common segments in its class but rather contains a mixture of the most commonly present amino acids in these other segments. In this way, it is able to match a greater proportion of proteins and can find a significant match in more than 8% of the genome’s proteins. It contains 16 E’s, 15 N’s, 14 D’s, and 5 other residues. Again, if all peptides that match poly-E are eliminated, the second most common segment consists mainly of N’s and its similarly high scoring segments are all very rich in poly-N tracts. This segment from SW-MAD1 contains no E’s but has 28 N’s, 10 D’s, and 12 other residues. It occurs in 405 other proteins, approximately 7% of the genome’s proteins. The third most frequent class contains 26 D’s, 9 E’s, and 15 other residues, the fourth contains 30 S’s, 15 T’s, and 5 other residues, and the fifth contains an amazing 47 Q’s and 3 other residues. A few other eukaryotic proteins have a well-characterized repetitive structure. Usually these are isolated instances, such as the alanine rich antifreeze proteins of fish ~Lin & Gross, 1981!, alanine tracts in molluscan shell framework proteins ~Sudo et al., 1997!, or the polyglutamine repeats in murine GRP-1 ~Cox et al., 1996!. Usually these repeats are with comparatively smaller numbers of identical amino acids such as a “long stretch” of nine

serines ~Milbrandt, 1987!. But the length and high repetition frequency seen here are more unusual. The opa repeats, originally discovered in insects, are comparable to the repeats described here. Opa repeats are simple sequence repeats consisting of poly-Q with, for example, 31 tandem residues are present in an opa repeat in the notch locus. Opa-like repeats have been discovered in Drosophila ~Wharton et al., 1985!, medflys ~Siden-Kiamos et al., 1993!, and mice ~Duboule et al., 1987; Persengiev & Kilpatrick, 1997!. They have been suggested to be characteristic of developmentally regulated genes ~Wharton et al., 1985!. But there is no direct evidence of a functional role in development other than their presence in such genes. Other developmentally regulated proteins are known in Dictyostelium that have large homopolymer runs ~Shaw et al., 1989!. The same five amino acids are found in the common peptide fragments composed of either 50 or 100 residues. There are, however, no unusual characteristics of these particular amino acids. They are not particularly large nor small amino acids. Nor are they particularly unreactive. Both D and E are acidic residues, S ~and T! are hydroxyl residues, while N and Q are amide residues. It is difficult to see how these repeats would form useful secondary or tertiary structures. If these tandem repeats have a distinct function that would explain their high frequency, it is not readily apparent. Many of these tandem repeats have been noticed in individual proteins and often their presence has been noted as rather unusual ~O’Hara et al., 1988; Vai et al., 1991; White et al., 1991; Heinonen & Pearlman, 1994; Di Como et al., 1995; Yamamoto et al., 1995; Cox et al., 1996; Sudo et al., 1997!. In other proteins containing similar tandem repeats, there can be a sufficient mixture of different but nevertheless functionally similar amino acids to camouflage their existence. A comparative program that searches for similar peptides and a complete genome is required to demonstrate that these repeats are present in unusually high numbers. The high frequency of these repeats in diverse proteins suggests that they must either have an important, broadly based function or that they are simply dispensable for the protein and happen to be residues that will not disrupt the remainder of the protein. The deletion of a serine-rich region in yeast protein gp115 also suggests that these regions may be dispensable ~Gatti et al., 1994!. Similarly the simple nature of their repetitive sequence argues that their probable function may be to simply serve as spacers between other protein motifs—the protein equivalent of junk DNA. A determination of the functions of these repeats must await further study, but in any case, the eukaryotic proteins obtained from the first such organism to be completely sequenced have an underlying repetitive nature whenever possible. This view is a departure from currently held ideas about the structural nature of a eukaryotic protein.

Table 2. The frequency of the most common classes of 50 mer’s in yeast Class

Frequency

Protein

Sequence

a b c d e

451 405 291 220 210

SW-YGP0 SW-MAD1 PIR-S64951 SW-AGA1 GP-172638

NGEDEDNDNDNENNNDNDNDNENENDNDSDNDDEEENGEEDEEEEEIEDL YNDSDDDDDNNVNNNDNNNNNKNDNNNDNNNDTSNNNNINNNNRTKNNIR SEDNEDDDTDEDSEDDDDDGGDDDDSEDDDDDDDGEGDENGDDGEGDENG PTTTSLSSTSTSPSSTSTSPSSTSTSSSSTSTSSSSTSTSSSSTSTSPSS QQKQQQQQQQHQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQGQ

Simple protein sequence Acknowledgments: I thank Dr. H. Bussey for his help and encouragement in the preparation of this manuscript and Dr. R. Pearlman and Dr. B. Coukell for their comments on previous drafts. I thank Dr. A. Edwards for suggesting this line of investigation. This work was supported by a Natural Sciences and Engineering Research Council of Canada grant.

References Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol 215:403– 410. Cox GW, Taylor LS, Willis JD, Melillo G, White R, Anderson SK, Lin JJ. 1996. Molecular cloning and characterization of a novel mouse macrophage gene that encodes a nuclear protein comprising polyglutamine repeats and interspersing histidines. J Biol Chem 271:25515–25523. Creighton T. 1993. Proteins: Structures and molecular properties, 2nd ed. New York: W.H. Freeman and Company. de Souza SJ, Long M, Schoenbach L, Roy SW, Gilbert W. 1996. Intron positions correlate with module boundaries in ancient proteins. Proc Natl Acad Sci USA 93:14632–14636. Di Como CJ, Bose R, Arndt KT. 1995. Overexpression of SIS2, which contains an extremely acidic region, increases the expression of SWI4, CLN1 and CLN2 in sit4 mutants. Genetics 139:95–107. Doolittle RF. 1995. The multiplicity of domains in proteins. Annu Rev Biochem 64:287–314. Dorit RL, Schoenbach L, Gilbert W. 1990. How big is the universe of exons? Science 250:1377–1382. Duboule D, Haenlin M, Galliot B, Mohier E. 1987. DNA sequences homologous to the Drosophila opa repeat are present in murine mRNAs that are differentially expressed in fetuses and adult tissues. Mol Cell Biol 7:2003–2006. Gatti E, Popolo L, Vai M, Rota N, Alberghina L. 1994. O-linked oligosaccharides in yeast glycosyl phosphatidylinositol-anchored protein gp115 are clustered in a serine-rich region not essential for its function. J Biol Chem 269:19695–19700. Gilbert W. 1978. Why genes in pieces? Nature 271:501. Gilbert W, de Souza SJ, Long M. 1997. Origin of genes. Proc Natl Acad Sci USA 94:7698–7703. Goffeau A, Aert R, Agostini-Carbone ML, Ahmed A, Aigle M, Alberghina L, Albermann K, Albers M, Aldea M, Alexandraki D, et al. 1997. The yeast genome directory. Nature 387~suppl!:1–105. Goffeau A, Barrell BG, Bussey H, Davis RW, Dujon B, Feldmann H, Galibert F, Hoheisel JD, Jacq C, Johnston M, et al. 1996. Life with 6000 genes. Science 274:546.

1361 Heinonen TY, Pearlman RF. 1994. A germ line-specific sequence element in an intron in Tetrahymena thermophila. J Biol Chem 269:17428–17433. Lin Y, Gross JK. 1981. Molecular cloning and characterization of winter flounder antifreeze cDNA. Proc Natl Acad Sci USA 78:2825–2829. Mewes HW, Albermann K, Bahr M, Frishman D, Gleissner A, Hani J, Heumann K, Kleine K, Maierl A, Oliver SG, Pfeiffer F, Zollner A. 1997. Overview of the yeast genome. Nature 387:7– 65. Milbrandt J. 1987. A nerve growth factor-induced gene encodes a possible transcriptional regulatory factor. Science 238:797–799. O’Hara PJ, Horowitz H, Eichinger G, Young ET. 1988. The yeast ADR6 gene encodes homopolymeric amino acid sequences and a potential metalbinding domain. Nucleic Acids Res 16:10153–10169. Ohno S. 1987. Early genes that were oligomeric repeats generated a number of divergent domains on their own. Proc Natl Acad Sci USA 84:6486– 6490. Persengiev SP, Kilpatrick DL. 1997. Characterization of a cDNA containing trinucleotide repeat sequences that is highly enriched in spermatogenic cells. Mol Repr Dev 46:476– 481. Shaw DR, Richter H, Giorda R, Ohmachi T, Ennis HL. 1989. Nucleotide sequences of Dictyostelium discoideum developmentally regulated cDNAs rich in ~AAC! imply proteins that contains clusters of asparagine, glutamine, or threonine. Mol Gen Genet 218:453– 459. Siden-Kiamos I, Favia G, Artiaco D, Saccone G, Furia M, Polito LC, Louis C. 1993. Opa-like repeats in the genome of the Medfly Ceratitis capitata. Genetica 92:43–53. Sudo S, Fujikawa T, Nagakura T, Ohkubo T, Sakaguchi K, Tanaka M, Nakashima K, Takahashi T. 1997. Structures of mollusc shell framework proteins. Nature 387:563–564. Vai M, Gatti E, Lacana E, Popolo L, Alberghina L. 1991. Isolation and deduced amino acid sequence of the gene encoding gp115, a yeast glycophospholipidanchored protein containing a serine-rich region. J Biol Chem 266:12242– 12248. Wharton KA, Yedvobnick B, Finnerty VG, Artavanis-Tsakonas S. 1985. Opa: A novel family of transcribed repeats shared by the Notch locus and other developmentally regulated loci in D. melanogaster. Cell 40:55– 62. White MJ, Hirsch JP, Henry SA. 1991. The OPI1 gene of Saccharomyces cerevisiae, a negative regulator of phospholipid biosynthesis, encodes a protein containing polyglutamine tracts and a leucine zipper. J Biol Chem 266:863–872. Wolfe KH, Shields DC. 1997. Molecular evidence for an ancient duplication of the entire yeast genome. Nature 387:708–713. Yamamoto A, DeWald DB, Boronenkov IV, Anderson RA, Emr SD, Koshland D. 1995. Novel PI~4!P 5-kinase homologue, Fab1p, essential for normal vacuole function and morphology in yeast. Mol Biol Cell 6:525–539.