Anonymous Arabidopsis cDNA Clones - NCBI - NIH

8 downloads 868 Views 1MB Size Report
Jul 13, 1994 - GenBank release 70 by using the BLAST e-mail server pro- vided by NCBI. The six possible deduced amino acid se- quences of the ESTs were ...
Plant Physiol. (1994) 106: 1241-1255

Genes Calore: A Summary of Methods for Accessing Results from Large-Scale Partia1 Sequencing of Anonymous Arabidopsis cDNA Clones' Tom Newman, Frans J. de Bruijn, Pam Green, Ken Keegstra, Hans Kende, Lee Mclntosh, John Ohlrogge, Natasha Raikhel, Shauna Somerville, Mike Thomashow, Ernie Retzel, and Chris Somerville*

Arabidopsis Expressed Sequence Tag Project, Department of Energy Plant Research Laboratory, Michigan State University, East Lansing, Michigan 48824 (T.N., F.J.d.B., P.G., K.K., H.K., L.M., J.O., N.R., M.T.); Computational Biology Center, Medical School, University of Minnesota, 1460 Mayo, UMHC 196, 420 Delaware Street S.E., Minneapolis, Minnesota 55455-031 2 (E.R.); and Carnegie lnstitution of Washington, Department of Plant Biology, 290 Panama Street, Stanford, California 94305-41 O1 (S.S., C.S.)

of 397 bp of sequence was obtained from one end of each of 2375 randomly selected clones from several commercially available cDNA libraries of human brain (Adams et al., 1992). In spite of the fact that no effort was made to eliminate redundant sequences, no gene was sequenced more than 16 times (actin), and the total number of 3-fold or greater redundancies was 142 (i.e. 80 were considered to indicate potentially significant homology. Data and Plasmid Storage

A11 EST sequences have been deposited in dbEST, a publicaccess data base designed specifically for ESTs (Boguski et al., 1993). Information on how to retrieve a sequence can be obtained from [email protected] by placing the word "help" in the body of the message and leaving the subject line blank. An efficient mechanism to determine if an EST for a particular protein of interest is available is to run a TBLASTN search against dbEST using the amino acid sequence of the known protein as the query. Information on how to format such a query can be obtained from [email protected] by placing the word "help" in the body of the message. The plasmids corresponding to the ESTs have been deposited with the Arabidopsis Biological Resource Center at The Ohio State University, 1735 Neil Avenue, Columbus, OH 43210. DNA may be ordered by mail, fax (1-614-292-0603), on line through the AIMS data base (for help contact [email protected]), or by e-mail from dna@ genesys.cps.msu.edu. Use of the AIMS data base is recommended because it contains a record of previous requests for each EST clone. EST ldentifiers

The clone names available in the dbEST report reflect the position of the clone in a 96-well plate (i.e. 49CllT7 is from plate 49, row C, column 11).As noted above, the last two or three letters indicate the primer that was used to produce the sequence. RESULTS cDNA Libraries

During the initial stages of the experiments described here, cDNA clones were picked randomly from several available libraries. During the latter stages of the project, clones were from an oriented XZipLox library, designated PRL2, which was constructed from equal amounts of mRNA from etiolated seedlings, roots, leaves, and shoots of a11 maturity stages. The primary library contained 1.2 X 106recombinant phage and, therefore, was considered to have adequate representation of the expressed genes. The quality of the library (with respect to insert size) was assessed by comparing the partia1 nucleotide sequences obtained for abundant isoforms of catalase and several Chl a binding proteins (Fig. 1).Of 12 sequences obtained for catalase, 7 contained the translation initiation codon. Similarly, of 15 Chl a binding clones sequenced, the translation initiation ATG was present in 12 clones. Thus, it appears that for mRNA species in the range of 1 to 1.7 kb, a majority of the cDNA clones contain the translational start codon. Since homology between plant and nonplant gene products is frequently not found at the amino-terminal region of the proteins, the assignment of probable function to ESTs by data base analysis is facilitated by the presence in the library

1243

of a certain proportion of less than full-length cDNAs so that interna1 sequences can also be obtained. However, since it eventually may be desirable to obtain the complete sequence of a11 the ESTs, the use of a library with a high proportion of full-length clones was considered preferable in the long term. Sequence Analysis

A total of 1518 single-pass nucleotide sequences were obtained from 1477 randomly picked cDNA clones. Fortyone of these clones were sequenced from both ends. For most of the sequences from the oriented libraries (PRL1 and PRL2), the sequences were obtained only from the putative 5' end of the cDNA to enhance the probability of obtaining coding sequence. Each sequence was manually processed to remove vector sequences from the 5' end, to resolve (as far as possible) sequencing ambiguities that were not assigned by the automated sequenator, and to decide where to terminate the sequence. The average EST produced in this way contains approximately 375 bp of sequence. Comparison of 31 ESTs with previously published sequences indicated that the error rate was approximately 0.3% (29/10,500) for the first 300 bp and about 4% for >300 bp. Severa1 of the ESTs obtained during the early stages of the project and deposited in dbEST have subsequently been found to contain vector sequences, which resulted from the sequencing run being longer than the insert in the cDNA clone or for other reasons. For this and related reasons, it is advisable to analyze any EST sequence for homology to the vector and to the current release of the data base before proceeding to use the sequence or the clone for any experimental purposes. Each of the edited sequences was deposited in dbEST and compared to the nonredundant nucleotide and protein sequences data bases by BLASTN (nucleotide homology) and BLASTX (deduced amino acid sequence homology) searches (Altschul et al., 1990). Deduced amino acid sequence homology between an EST and a known sequence was deemed significant if the BLASTX PAM120 score was greater than 80. From this analysis 292 of the ESTs had significant homology to 88 previously identified cDNA clones from Arabidopsis (Table I). In most cases the previously identified clones were ESTs reported by Hofte et al. (1993). In many instances the sequence identity between the newly identified ESTs and the previously identified cDNA clones was less than 95% at the nucleotide level, indicating that the clones might represent isoforms of the previously identified gene (or, less likely, an abnormally high sequence error rate). As an example of the biological complexity underlying this apparent redundancy, the current release of dbEST contains 316 Arabidopsis ESTs with homology to known kinases. A preliminary analysis of the number of different kinases represented by this collection indicated that there are dozens of structurally different enzymes represented by this subset of ESTs. Thus, a detailed analysis of the total number of different genes represented by the ESTs is beyond the scope of this article. The list of individual clones with homology to the EST classes in Table I can be obtained from the data bases by following the instructions presented below. One hundred seventy-seven of the ESTs showed significant deduced amino acid sequence homology to 113 previ-

1244

Plant Physiol. Vol. 106, 1994

Newman et al.

A

*

46F6T7STD 40G377 STD

-B

35G12T7 STD

>

15D5T7.STD 15F3T7.STD 4 0 A l T 7 STD

>

,

20ClT7.STD

.

,

42A6T7.STD

>

35F2T7 STD

.~

+

36ClT7.STD

3406T7 STD

438477 S l D

.

B ATHLHCP3 DNA

43A7T7 STD

>

,

47H3T7 STD

35A7T7 STD

>

,

,

34ClZT7.STD 2D4T7P STD

,

3601T7.STD

I

35G10T7.STD

>

>

I

3 5 G l l T 7 STD

I

ATHLHCPP DNA

* 20C6T7 STD

L

36H7T7 STD

31DlT7.STD

,

33G6T7 STD ~~~~~

~~~~

~

1

1.132

Figure 1. Comparison of EST sequences obtained for catalase (A) and Chl a/b binding proteins (B).The extent of the fulllength cDNA sequence for each gene is shown as long arrows representing the single full-length clone for catalase (CAT1) or the three clones available for members of the Chl a/b binding protein family (ATHLHCP3, ATHLHCPZ, ATHLHCP1). The direction and extent of the ESTs obtained for each gene are shown as horizontal arrows. The positions of Met codons are shown as small boxes and the translation initiator codons are indicated by solid vertical arrows.

ously identified genes from plants other than Arabidopsis (Table 11).In some instances these ESTs appeared to represent new isoforms of the previously identified plant genes. A striking example of this is represented by the isolation of seven distinct Cyt P450 sequences. In view of the relatively low deduced amino acid sequence identity between the Arabidopsis clones and the previously isolated clones from other species, it is possible that each of the corresponding proteins catalyzes a different enzymatic reaction. Thus, even though this class of ESTs corresponds at some leve1 to previously known plant genes, the clones may prove useful starting materials for investigations of the corresponding functions in Arabidopsis. In this respect it should be noted that some of the apparently distinct EST sequences may actually represent different nonoverlapping regions from the same gene. One hundred eighty-three of the ESTs showed significant homology to 165 previously identified genes from species other than higher plants (Table 111). The sources of the homologous genes varied from bacteria to humans. Many of the ESTs showed homology to enzymes from ubiquitous metabolic pathways, structural proteins, and components of

the transcriptional or translational apparatus. Others showed homology to proteins involved in functions thaí are not known to exist in plants. For example, ESTs were identified with apparent homologies to bovine brown fat uncoupling protein, the agglutinin core subunit from yeast, a cyclic nucleotide gated channel from catfish, a fibronectin binding protein, and many other proteins that cannot be iminediately assigned probable functions in plants. Many other ESTs corresponded to functions that might correspond to known functions. For instance, a putative clone for an epoxide hydrolase could be involved in cutin synthesis or in c xotenoid metabolism. A clone for a putative acyl-COA bindicg protein could represent a new lead to the as-yet unresolvec problem of how lipids move between intracellular membranes. DISCUSSION

The importance of high-throughput cDNA sequmcing resides in the fact that it is an extremely efficient way of connecting plant biology to nonplant biology. For the past 15 years, biologists who do not work on plants h,we been

Arabidopsis cDNA Sequences

1245

Table 1. List of previously identified Arabidopsis genes or gene families for which one or more E5Ts

were identified 14-3-3 protein 2s seed storage protein RNA binding protein 31 kD Adenylate translocator ADP-ribosylation factor a tubulin

Amino acid permease I Annexin APG protein ATHBZ homeobox protein Auxin-induced protein ,8 tubulin

Blue copper binding protein Chl a/b binding protein Chl a/b binding protein Chl a/b binding protein Calmodulin-like 22-kD protein Carbonic anhydrase Carboxypeptidase Y Catalase Cdc2 Chalcone synthase CHLl gene product Cor47 CP29

Cruciferin DNA binding protein DRT 100 gene product

Elongation factor 1-a Elongation factor Tu Enolase Ethylene-forming enzyme Eukaryotic initiation factor 5A Fd Ferritin

Hypothetical transmembrane protein Ascorbate peroxidase Laminin receptor Leu aminopeptidase Ltil40 gene product Meri-5 Metallothionein Lipid transfer protein Peroxidase Phosphoribulokinase PSll 10-K protein PSll 33-kD protein Plasmamembrane H+ATPase Poly(A)binding protein Polyubiquitin Protein kinase Protein kinase C inhibitor PSll 33-kD oxygen-evolving protein Pyrophosphate-energized vacuolar proton p u m p Receptor-like protein kinase Ribosomal protein L12 Ribosomal protein L17 Ribosomal protein L19 Ribosomal protein L27 Ribosomal protein L3 Ribosomal protein L9 Ribosomal protein S13 Ribosomal protein S19 Rubisco activase Rubisco SS 1 A Rubisco SS 16 Rubisco SS 26 Rubisco SS 38 5-Adenosylmethionine synthase Superoxide dismutase

Flavonol4-sulfotransferase

Thaumatin

Fr-bisP aldolase

Thioglucosidase Tonoplast intrinsic protein Topoisomerase Transmembrane protein

Glutamine synthetase Glutathione 5-transferase Glyceraldehyde-3-P dehydrogenase

Gly-rich proten GTP binding protein Heat-shock 70-kD cognate Hydroxymethylglutaryl-COA reductase

Ubiquitin

Ubiquitin-conjugatingenzyme Ubiquitin extension protein Vacuolar ATP synthase

producing large amounts of sequence information about proteins and genes of known function from a wide variety of organisms. The availability of a comprehensive collection of Arabidopsis and rice ESTs will facilitate the ability of plant biologists to directly utilize this vast pool of knowledge about proteins and genes from nonplant organisms. Frequently, the products of these nonplant genes exhibit enough homology to the corresponding plant genes so that only a few dozen amino acid residues of sequence information are sufficient to identify a statistically significant match. In other cases, such as the family of Cyt P450s described here, homology between genes of related but different function can be used to identify potentially useful new genes. This example also illustrates an important caveat to the use of homology searching: genes of

different function may appear homologous. Thus, results from EST analysis are essentially just hypotheses that must be tested by other criteria. Nevertheless, because of the rapidity with which the data bases of ESTs are currently growing, it is important to know how to access and use this information. The following discussion identifies some of the relevant issues in this respect. Cene Representation in the EST Data Bases

Based on the rate at which EST sequences are currently being produced in various laboratories, we believe that partia1 sequence information will be available for the majority of plant genes in the foreseeable future. Based on estimates of

Newman et a].

1246

Plant Physiol. Vol. 106, 1994

Table II. lnventory of Arabidopsis €STs with significant homology to genes from other plants ESTs with homology to Arabidopsis genes are not listed. The EST# is the accession number assigned by dbEST. The numbers in ttie columns designated ID, Similar, and Overlap refer to the number of identical (ID) or similar (Similar) amino acids in a contiguous region of ;I particular length (Overlap). The heading Organism refers to the source of the protein that exhibits homology to the Arabidopsis EST. In those cases where more than one EST showed sigificant homology to a particular protein, the number of "hits" is indicated by a number in parentheses in the Putative ldentification column. EST#

Putative ldentification

ID

Similar

34612 21240 35217 34874 35102 21164 34964 20833 34893 21138 20775 21577 34784 34760 34794 21055 35173 3472 1 3462 1 3471 1 21035 21310 34914 34891 21207 20901 21629 21 529 20949 20831 21232 35095 34823 35104 21239 21077 35193 35018 35046 35094 21611 21368 21405 21 126 21 597 34838 2 1489 2 1545 20791 35147 35055 34632 34857 34996 34942 21110 34625

14-3-3-like protein (2) ACC oxidase (2) Acetyl-COA carboxylase Actinidin Adenylate kinase ADP-GIu pyrophosphorylase Aleurain a-Galactosidase Annexin Anther-specific protein SF18 ATP synthase 6, mitochondrial ATP synthase 7 , mitochondrial Auxin down-regulated gene ADR11 Auxin-induced protein PCNTlO7 P-1,3-Glucanse p-Clucosidase (4) P-Ketoacyl-ACP synthase Chl alb binding Cathepsin B Chloroplast inner envelope protein Cinnamyl-alcohol dehydrogenase CP24 Chl alb binding 1OB (3) Cystatin (2) Cys synthase Cyt B6-F Cyt P450 type I Cyt P450 type II Cyt P450 type III Cyt P450 type IV Cyt P450 type V Cyt P450 type VI Cyt P450 type VI1 Dihydroflavonol-4-reductase Dihydrolipoamine dehydrogenase Early light-inducible protein Elongation factor I a (2) Endo-l,3-p glucosidase Endoplasmin (HSP 90) ENOD8 Ethylene-forming enzyme Flower senescence-related protein (6) Fru-bisP aldolase (5) Ceranylgeranyl PPi synthetase Glutamate synthase Heat-shock 70-kD, mitochondrial Hyp-rich glycoprotein Hypothetical 16.5-kD protein (4) Hypothetical protein (6) lniation factor 4A lnitiation factor 5A lsocitrate dehydrogenase (NADPH) lsopropylmalate dehydrogenase Jacalin heavy chain (2) Late embryogenesis abundant protein Lectin I Lectin I I Legumin

61 51 57 66 92 64 21 21 22 25 25 28 21 50 17 55 97 73 22 37 72 57 34 96 28 19 47 27

72 72 75 74 1 o1 74 35 28 26 33 31 30 30 57 23 70 106 85 39 45 86 59 52 1O 0 35 35 63 38 44 54 40 62 32 25 30 75 40 67 28 61 86 107 32 89 104 28 48 38 111 21 99 81 31 43 41 33 26

33

36 22 46 26 23 22 57 24 55 22 59 64 99 24 82 95 23 30 30 107 16 91 75 20 27 24 22 15

Overlap

81 98 102 1O0 116 92 75 42 41 50 37 34 63 72 32 91 126 103 67 50 110 67 77 115 61 53 91 56 64 74 71 108 45 26 49 98 58 103 5o 73 1o4 124 53 97 109 67 81 51 112 24 113 88 62 83 88 63 46

Score

Organism

316 293 294 346 469 300 85 124 99 157 141 131 115 264 93 309 494 380 106 212 395 325 189 472 104 102 240 141 190 212 114 224 140 127 1o1 308 133 275 96 316 360 489 107 420 483 107 130 163 597 83 482 386 95 121 102 106 90

Oenothera hookeri Tomato Maize Kiwi Rice Potato Barley Cyanopsis tetragoncdoba Tomato Sunflower Sweet potato Sweet potato Soybean Tobacco Brassica napus Jrifolium repens Castor Tomato Wheat Spinach Tobacco Tomato Maize Spinach Tobacco Avocado Avocado Avocado Catharanthus roseu:; Avocado Avocado Avocado Antirrhinum majus Pea Barley Rice Barley Barley Medicago sativa Brassica juncea Dian th us ca ryophyl.'us Spinach Capsicum annum Maize Pea Zea diploperennis Tobacco Strawberry Tobacco Medicago sativa Soybean Brassica napus Jackfruit Cotton Medicago truncatula Doliehos biflorus Vicia faba Continued on next page

1247

Arabidopsis cDNA Sequences

Table II. Continued EST#

34757 3491 7 21641 34716 21075 21315 35 164 351 58 21468 21365 20850 20933 34862 20842 20954 21588 34974 21072 21102 2 1589 34999 21612 21504 21256 35040 20957 21 163 21212 21151 20950 35112 21171 21490 21471 21104 35162 20915 21018 2 1542 35083 2 1474 34758 35216 35012 21609 20962 34693 21270 34935 2 1459 2 1066 34789 34759 34768 35061

Putative ldentification

ID

Similar

Overlap

Score

Lupin-specific protein PPLZO2 Major latex protein (2) Malate dehydrogenase (NADP) Malate dehydrogenase, glyoxysomal Malate synthase, glyoxysomal Malic enzyme, NADP-dependent (2) MAP kinase homolog type I MAP kinase homolog type II Membrane channel, root specific Monodehydroascorbate reductase Multiple stimulus response protein (2) Myb proten 308 Myrosinase (3) Oryzain a chain (2) Oryzain chain (2) Oryzain y chain Pathogenesis-related protein Pectate lyase Pectinesterase2 PEP carboxylase (2) Peroxidase I (3) Peroxidase II Peroxidase III Peroxidase, cationic I Peroxidase, neutra1 Phosphate translocator, chloroplast (3) Phosphoglycerate kinase Phosphoglycerate mutase (2) Pistil extensin-like protein Polygalacturonase inhibitor (2) Profilin I, pollen antigen (4) Pro-rich protein (2) PSI reaction center subunit IV PSI 20-kD protein (2) PSI subunit III (2) PSll 16-kD subunit PSll 23-kD protein (8) Putative membrane channel protein Pyruvate decarboxylase RAS-related CTP-binding protein RAS-related CTP-binding protein RAS-related GTP-binding protein (2) Receptor-like protein kinase Ribosomal proten L16 Ribosomal protein L23 Ribosomal protein L24 Rubber elongation factor S-Receptor kinase (2) S-Adenosyl-Met decarboxylase Senescence-related protein D l N l (3) Stearoyl-ACP desaturase Stem-specific protein Suc-phosphate synthase Vacuolar ATP synthase 16-kD subunit (3) Vacuolar ATP synthase 69-kD subunit

27 30 56 34 49 105 38 16

37 37 64 38 50 117 56 20 131 43 116 89 47 94 102 23 46 83 30 106 58 56 50 59 37 1O0 45 102 23 38 76 23 17 86 56 35 97 83 41 41 52 99 48 61 91 78 38 60 54 86 40 40 30 36 98

50 72 82 45 57 133 73 21 145 46 128 95 49 108 119 31 57 111 46 1o9 72 68 69 74 44 104 52 114 55 55

137 133 293 171 2 72 544 220 88 640 174 533 471 233 464 487 81 216 366 102 520 277 245 21 1 2 70 185 498 21 1 495 95 110 369 105 88 423 262 144 426 362 207 160 314 447

121

33 87 86 42 88 85 17 42 68 22 99 52 43 42 50 36 94 44 94 19 25 67 18 16 81 50 30 83 71 36 39 42 96 32 59 90 61 24 49 42 70 32 24 25 35 96

1O0

35 17 90 61 42 111 92 47 44 56 104 1O0 75 93 106 62 83 75 110 52 96 44 54 115

111

31 1 501 31 7 130 244 216 360 190 93 126 160 477

Organism Lupinus polyphyllus Papaver Sorghum Citrullus vulgaris Brassica Populus trichocarpa Pea Medicago sativa Tobacco Cucumis sativa Tobacco An tirrhinum Brassica napus Rice Rice Rice Tobacco Tobacco Tomato Sorghu m Cotton Vigna angular; Turnip Tomato Horseradish Spinach Spinach Maize Tobacco Pyrus communis Maize Brassica napus Barley Spinach Haveria trinervia Spinach Tomato Tobacco Maize Rice Pea Pea Pyrus communis Spinach Sinapis alba Pea Hevea brasiliensis Brassica Potato Radish Jojoba Tobacco Potato Avena sativa Carrot

Plant Physiol. Vol. 106, 1994

N e w m a n et al.

1248 ~

~

_____~

~

~~~

Table 111. lnventory of Arabidopsis ESTs with significant homology to nonplant genes See Table II for an explanation of the column headinns. EST#

34717 21168 21327 21470 20896 2 1287 20853 34961 20862 34824 35201 34661 21174 21148 34725 21451 34907 20952 21614 20849 21 343 35126 21606 20782 20927 34602 21521 21051 351 11 21250 34754 21019 21568 2 1242 2 1622 2 1060 21427 20865 2 1064 21350 34668 21342 34669 21515 2 1030 21109 34634 35035 34742 34945 21639 34666 21650 34752 34655 21 579 35125 34833 34989 33953 34066

Putative ldentification 26s protease subunit 4 4-Nitrophenylphosphatase a-Agglutin core subunit (Ser rich) Acyl carrier protein Acyl-COA binding protein Acyl-COA oxidase, peroxisomal (2) ADP/ATP carrier protein ADP/ATP carrier protein, mitochondrial Ala aminotransferase Alcohol dehydrogenase Aldehyde dehydrogenase type I Aldehyde dehydrogenase type II a toxin a-Glucosidase, lysosomal (3) Aminomethyltransferase Ankyrin 2 Apolipoprotein A-IV Arsenical pump-driving ATPase Aspartic acid rich proten ATP synthase B’, chloroplast ATP synthase E subunit, vacuolar ATP-binding protein Bacteriochlorophyll synthase Bactoferritin co-migratory protein P-Hydroxybutryl-COA dehydrogenase Brown fat uncoupling protein Cathepsin E Cell-division control protein Cell-division protein Chaperonin-like protein Citrate lyase Collagen-related protein 2 Cyclic nucleotide gated channel Cysteinyl-tRNA synthetase Cyt B561 Cytoplasmic protein transport (sec23) Diaminopimelate epimerase DNA repair protein RAD18 Dynamin-1 Elongation factor 2 Elongation factor 3 Elongation factor I, gamma (2) Elongation factor Tu Endoglucanase (cellulase) Epoxide hydrolase Ferripyrochelin binding protein Fibronectin binding protein (Pro rich) Galactokinase Gephyrin, microtubule-associated proten Glc derepression factor POP2 Glc transport protein Glc-6-phosphate dehydrogenase (2) Glutamate decarboxylase (2) Glutamate synthase Glutaredoxin Granaticin polyketide synthase GTP binding protein GTP binding protein GTP binding protein GTP cyclohydrolase II Hemoglobinase

ID

Similar

OverlaD

Score

Oraanism

35 23 28 38 40 69 45 27 27 27 39 24 33 29 18 34 19 23 11 37 41 20 23 21 41 17 23 88 73 77 25 16 25 27 39 60 37 14 63 91 19 30 84 25 41 24 20 33 23 13 32 51 49 22 29 25 45 23 31 52 38

47 34 53 47 54 90 68 38 39 33 43 31 44 38 21 48 34 32 12 67 56 24 29 29 60 20 34 1 o1 84 90 44 23

61 50 97 59 77 130 97 61 51 52 56 50 69 54 26 88 72 47 14 123 78 28 44 46 64 26 61 117 1o1 105 92 28 1O 0 39 113 104 79 34 117 108 45 58 116 46 99 50 51 81 52 40 105 87 95 47 59 61 64 57 41 81 69

183 125 1o1 202 206 380 235 141 152 144 196 104 185 181 105 103 86 1O0 92 147 204 1o1 85 94 282 88 110 405 383 406 81 101

Human Yeast Yeast Neurospora crassa Human Rat Rickettsia prowazekii Yeast Human Chicken Bovine Human Clostridium perfringer s Human Bovine Human Mouse E. coli Plasmodium falciparuin Synechococcus PCC6301 Manduca sexta E. coli Rhodobacter sphaero der E. coli Clostridium acetobut),licum Bovine Cavia porcellus Yeast E. coli Human Rat Hydra magnipapillata Channel catfish E. coli Bovine Yeast E. coli Yeast Rat Chlorella kessleri Yeast Artemia Thermus aquaticus Clostridium thermocdlum Human Pseudomonas aerugirtosa Staphylococcus aureiis Human Rat Yeast Synechocystis PCC6803 Yeast E. coli Azospirillum brasi1eni;e Yeast Streptomyces violace Pruber Yeast Yeast S. pombe B. subtilis Schistosoma japoniciim

49

28 64 80 51 17 85 1O0 27 39 96 34 64 37 25 46 35 27 63 65 68 27 37 35 62 30 37 60 50

94

150 158 345 180 85 328 485 94 146 447 142 265 143 97 165 112 86 122 2 70 227 91 137 126 243 113 162 258 200

.

Continued cin next page

Arabidopsis cDNA Sequences

1249

Table 111. Continued Putative ldentification

EST#

ID

Similar

Overlap

Score

Organism

~

21627 2 1636 35084 35067 21 533 21086 35072 35098 34961 21245 21 541 20970 34828 20940 21286 2 1208 34778 34763 20946 21 083

His-rich glycoprotein Histone H2A type IV (2) Histone H2A type VI Hydantoinase Hydrogenase (2) Hydroxylase Hypothetical20-kD open reading frame Hypothetical 272-kD protein Hypothetical 96-kD yeast protein YKL525 Hypothetical Pro-rich protein Hypothetical protein lnositol 1,4,5-triphosphate 5 phosphatase lsocitrate dehydrogenase (NAD+) Isopentenyl-diphosphate isomerase lsopropylmalate dehydratase KDEL receptor Keratin type II Kinase, casein type I 6 Kinase, casein type II Kinesin light chain isoform 4

38 63 40 42 20 14 19 51 27 16 45 24 36 23 38 43 36 29 23 24

53 72 48 57 33 29 25 65 38 16 56 34 45 27 55 53 38 39 35 34

98 93 66 86 61 44 37 90 61 19 84 52 70 33 102 63 74 51 50 77

214 2 70 201 219 99 83 101 31 7 141 105 223 125 190 126 170 221 177 165 113 89

21071 21098 2 1005 35034 20789 21219 34929 21 169

Lactaldehyde dehydrogenase Lactoyl glutathione lyase Lipase Malate dehydrogenase Malate dehydrogenase, cytoplasmic Mei2 gene Met synthase Met synthase Methylamine oxidase M H C class III RD-repeat protein MHC H-ZK/t-w5-linked open reading frame Microtubule-associated protein Mitosis inducer (protein kinase) MOV34, embryogenesis factor Multidrug resistance protein Multiple antibiotic resistance NADPH dehydrogenase Neurofilament protein H, form H1 Nucleolin Oxoglutarate/malate carrier protein Oxoisovalerate dehydrogenase Oxysterol binding protein (2) Paired amphipathic helix protein Pancreatic tumor-related protein Peptidyl-prolyl cis-trans isomerase Phosphogluconate dehydrogenase Phospholipase C Phosphoprotein phosphatase 2C Placenta1 protein 15 Poly-pyrimidine tract-binding protein Pre-mRNA splicing factor Prohibitin Proteasome component C3 Protein kinase C receptor Protein kinase, G2-specific Proteosome component PUP1 Quinone oxidoreductase Raf-1 proto oncogene Riboflavin synthetase Ribosomal protein HS6 Ribosomal protein LI0

17 17 19 36 22 54 53 43 49

34 25 25 40 29 69 60 52 73 39 29 43 33 88 58 30 54 40 37 31 91 25 37 39 27 31 28 28 47 26 39 79 55 66 50 54 42 20 55 53 37

50 38 36 55 40 98 93 88 101 55 40 98 65 114 70 30 128 98 79 49 110 38 73 75 47 39 37 40 69 45 56 99 69 88 74 67 65 28 71 83 57

96 85 90 169 111 285 263 189 291 135 88 87 118 352 230 153 133 107 83

21037

2 1603 20839 34926 34745 34909 21175 2 1643 21565 35063 20928 35026 2 1032 35166 21380 35003 34624 21505 21027 3481 1 2 1266 21288 35089 20904 34783 34679 2151 1 20942 20958 34596 21264 35180 34860

21

20 23 28 65 44 29 30 32 20 23 77 18 18 29 16 20 16 24 31 19 26 56 47 54 38 37 32 15 34 32 23

112

423 98 102 130 87 120 94 129 170 88 149 294 227 285 197 187 161 80 200 166 111

Plasmodium lophurae Volvox carteri Ch icken Pseudomonas putida Anabaena cylindrica Streptomyces halstedill E. coli C. elegans Yeast Owenia fusiformis C. elegans Human Yeast Yeast Phycomyces blakesleeanus Human Mouse Rat Human Strongylocentrotus purpura tus E. coli Human Rhizomucor miehei Thermus aquaticus Pig S.pombe E. coli Yeast Arthrobacter sp. Mouse Mouse Rat S.pombe Mouse Human E. coli Yeast Rabbit C hicken Human Bovine Human Yeast Human E. coli E. coli Listeria Rat Human Rat Human Human Human Rat S.pombe Yeast E. coli Human Photobacterium Haloarcula marismortui Yeast Continued on next page

1250

Plant Physiol. Vol. 106, 1994

Newman et al. ~

Table 111. Continued Putative ldentification

EST#

2 1408 35141 34777 21421 20797 21441 21 51 3 34657 21 353 34892 35058 21 553 35096 20909 21318 21113 35142 21224 2 1293 21218 21 114 21158 21397 20956 21007 21627 34904 21155 2 1644 35122 21026 21210 21360 21430 20984 34667 34957 21 387 34741 34795

Ribosomal protein L14A Ribosomal protein L3 Ribosomal protein L36 (2) Ribosomal protein L38 Ribosomal protein L4 Ribosomal protein L41 Ribosomal protein L6 Ribosomal protein L8 Ribosomal protein S2 Ribosomal protein S3 (2) Ribosomal protein S6 (2) Ribosomal protein S8 Ribosomal protein U R P l (2) Ribosomal protein YL41 RNA binding protein RNA helicase (2) RNA polymerase I suppressor protein S-Adenosyl-Met decarboxylase (2) Secl4

Ser-rich protein Single-stranded DNA binding protein Small nuclear ribonucleoprotein Splicing factor U2AF 65-kD subunit Stress-inducible protein Succinyl COA synthetase Sulfated surface glycoprotein (2) Surface glycoprotein Synaptobrevin T-cell specific protein Tetrahydrofolate synthase Thermophilic factor Thiolase (2) Transaldolase Transketolase Translation factor Sul1 Tyrosine aminotransferase Vacuolar ATPase 36-kD subunit Vacuolar sorting protein VPS35 Valosin-containingpolypeptide (CDC48) Vitelloaenin (Ser-rich)

ID

Similar

Overlap

Score

40 13 41 38 15 73 45 51 43 92 47 60 42 20 33 36 22 40 27 24 18 48 29 31 16 25 17 23 20 32 25 17 26 43 20 41 28 27 70 26

50 20 51 42 29 77 60 68 53 96 60 80 55 22 42 57 29 54 35 29 28 59 45 41 27 26 22 50 26 46 35 28 43 56 24 60 44 40 86 31

66 26 80 45 38 78 79 89 90 116 82 112 86 25 57 76 39 92 62 45 44 74 63 52 37 42 27 77 37 84 74 44 65 82 32 97 67 62 1o1 57

206 82 196 21 1 101 404 246 290 199 456 251 314 218 1o1 178 210 114 173 124 87 101 248 168 166 90 140 94 136 115 156 104 93 136 197 1o9 222 149 135 363 80

the total genome size of Arabidopsis, the average size of a gene, and the average distance between genes, it has been estimated that Arabidopsis has enough DNA to encode only about 25,000 genes at most (assuming 1 kb between adjacent genes) (Gibson and Somerville, 1993) and probably has fewer (Meyerowitz, 1994). Thus, if redundant sequencing could be avoided, one laboratory with severa1 automated sequenators could be expected to obtain sequences of cDNAs for most or a11 of the genes in Arabidopsis in about 3 years. As in other EST sequencing projects, we initially relied on randomly chosen cDNA clones as a source of ESTs. Thus, the first few thousand Arabidopsis and rice ESTs are enriched with sequences representing highly abundant cDNAs. Because moderately expressed genes tend to have a higher probability of showing homology to a known gene in the sequence data bases than weakly expressed genes (Green et al., 1993), the relatively high frequency of ESTs that exhibited

Organism

Xenopus Cyanophora paradoxa Rat Rat

Yeast Candida maltosa Cyanophora paradoxa Rat

Drosoph ila Xenopus Human Rat

Yeast Yeast Drosoph ila E. coli Yeast Rat

Yeast Plasmodium falciparurn Human

Drosophila Mouse Fusarium oxysporum lhermus aquaticus Volvox carteri Trypanosoma brucei Yeast Mouse Yeast Sulfolobus shibatae Human

Yeast E. coli Yeast Rat

Yeast Yeast Pig

Chicken

homology to a known gene may not be sustained upon further sequencing. However, during the sequencing of more than 85,000 human ESTs, there was no significant decline in the frequency with which an EST could be assigncsd putative function by comparison to the data bases (C. Venter, personal communication). The rationale for using the single cDNA library described here rather than a collection of libraries preparec! from different tissues (Hofte et al., 1993) is that in order to sequence cDNAs for a11 of the expressed genes in Arubidopsis by sequencing randomly chosen anonymous clones, it will be necessary to develop methods that limit redundan t sequencing of abundant cDNAs and simultaneously enhance the frequency of cDNAs that are expressed at very lo” levels or that are expressed only in a small number of cells or under nonambient conditions. One of the ways in which this may be accomplished is by the construction of a normalized library

Arabidopsis cDNA Sequences

in which a11 cDNA clones are represented with similar abundance (Sankhavaram et al., 1991). The use of the XZipLox vector and the decision to pool mRNA from various tissues rather than to sequence tissue-specific libraries was mediated by considerations pertaining to library normalization. Since many examples of "tissue-specificgene expression" are quantitative rather than qualitative, effective normalization of tissue-specificlibraries would be expected to eliminate much or a11 of the information associated with the origin of the RNA. By contrast, since many of the same genes would be present in a11 of the tissue-specific libraries, the use of these libraries would require the sequencing of much larger numbers of ESTs in order to sample at least 95% of a11 cDNAs at least once. Tissue-specific libraries do- offer the advantage that the tissue of origin of a cDNA is known. However, since it is not known if this is the only tissue of expression, this information is of limited value unless a very large number of sequences are available from a11 tissues. In summary, we believe that it is feasible to obtain partia1 sequence information on a11 cDNAs in the foreseeable future. Thus, many of the problems currently associated with gene isolation in higher plants will increasingly become problems associated with gene identification in data bases. Accessing the D a t a Bases

Although various lists of EST homologies are available (e.g. Tables I, 11, and III), such lists are rapidly made obsolete because of increases in the information and sequence content of the international data bases. Most of the groups that are currently sequencing large numbers of plant ESTs deposit the sequences in intemational public data bases on a regular basis but may not publish summaries of the information for extended periods, if at all. Therefore, the only practical method to determine if an EST is available for a particular protein is to search the data bases directly on a regular basis. There are two different ways to search for an EST that differ in the speed, the leve1 of computer resources required, and the ability of the user to control the parameters of the search. Perhaps the simplest method, but also the most powerful, is to directly compare a known gene product with the translation of a11 six frames of a11 the ESTs in the public data bases. The second method is to perform a text search of a "preanalyzed" version of the EST data base. The various steps involved in comparing an amino acid sequence with the six-frame translation of the dbEST data base is best illustrated by example. Suppose the goal is to determine if an Arabidopsis EST corresponding to the KDEL receptor was present in the data base (an EST corresponding to the KDEL receptor was, in fact, identified at an early stage of the EST project and shown to encode a functional homolog of the yeast protein [Lee et al., 19931). The first step would be to retrieve the sequences of the various KDEL receptors known in other organisms. This could be done by sending an e-mail message with the keyword erd2 (the name of the yeast gene for the KDEL receptor) to the address: [email protected]. The subject of the message could look exactly like the following example (although other optional instructions could be added and data bases other than GenBank could be searched):

1251

DATALIB GENBANK BEGIN ERD2 Within severa1 minutes, the RETRIEVE server will send back a listing of the records for a11 accessions with the keyword erd2. It is important to note that similar information could be obtained by using many other keywords or accession numbers. For instance, the search could also be done using the words KDEL and receptor instead of the word erd2. To obtain detailed instructions for the formatting of searches and other options, send the word help to the same address. The next step is to use one or more of the protein sequences corresponding to known KDEL receptors to search the dbEST data base by sending an e-mail message to the BLAST server at [email protected]. The BLAST algorithm is a heuristic for finding ungapped, locally optimal sequence alignments, which was developed by the NCBI at the National Library of Medicine (Altschul et al., 1990). The BLAST family of programs employs this algorithm to compare an amino acid query sequence against a protein sequence data base or a nucleotide query sequence against a nucleotide sequence data base, as well as other combinations of protein and nucleic acid. In the example below we request that the yeast erd2 amino acid sequence be compared to the six possible translations of a11 the nucleotide sequences in the dbEST data base. It is a testament to the power of the current generation of computers that this task requires only a few minutes. Here we have used only the yeast sequence, but in practice it would probably be best to use a11 known sequences from different classes of organisms. The message could be structured as follows (although other options are available): PROGRAM TBLASTN DATALIB DBEST BEGIN >TEST OF YEAST ERD2

MNPFRILGDLSHLTSILILIHNIKTTRYIEGISFKTQTLYALVF ITRYLDLLTFHWVSLYNALMKIFFIVSTAYIVVLLQGSKRT NTI

AYNEMLMHDTFKIQHLLIGSALMSVFFHHKFTFLE LAWSFSVWL ESVAILPQLYMLSKGGKTRSLTVHYIFAMGLYRALYIP NWIWRY STEDKKLDKIAFFAGLLQTLLYSDFFYIYYTKVIRGKGFKL PK The fifth line is a comment that serves to identify the search. The amino acid sequence that follows should not have any blanks or characters other than the single-letter code for the amino acid sequence. The result of this search is a listing of the entries in dbEST that exhibit the highest degree of protein and nucleotide similarity to the query sequence. An example of the abbreviated output for this search is shown in Figure 2. The search identified two Arabidopsis ESTs with significant homology to the erd2 protein, as well as a yeast and human EST. The final step is to retrieve the nucleotide sequences for the relevant ESTs from the dbEST data base. Each EST has a variety of numbers associated with it that can be used to retrieve it from the data base. The simplest of these is the

Plant Physiol. Vol. '106, 1994

Newman et al.

1252 Puery=

TEST OF YEAST ERD2 (219 l e t t e r s )

Database:

Database o f Expressed Sequence Tags, Release 2.11, 34,292 sequences; 10,827,722 t o t a l Letters.

Sequences producing High-scoring Segment Pairs: g n l idbest'21208 g n l dbest I34427 g n l Idbest I36310 g n l Idbest I27594 g n l Idbest I26586 g n l Idbest I37898 g n l l d b e s t I38740 g n l Idbest I28850 g n l Idbest I44038 g n l Idbest I33267 g n l Idbest I 19765 g n l ldbest I4042 g n l ldbest I 29657 g n l Idbest I29383 g n l Idbest I21810 g n l Idbest I4426 I I gnlIdbest,30892

May 5, 1994

Reading High Frame Score

.. + I ... +I

cDNA Lambda-PRL2 A.thaliana Homology:. cDNA Lambda-PRL2 A.thaliana Homology: cDNA Rice c a l l u s 0.sativa Homology: s... cDNA I n f a n t Brain, Bento Soares H.sapi cDNA F e t a l brain, Stratagene ( c a t H 3 6 2 cDNA Rice r o o t 0.sativa Homology: s p l cDNA STRATAGENE Hunan s k e l e t a l muscle Genomic gmbPfHB3.1, G. Roman Reddy P.f cDNA cbsPfHB3.1, Debopam Chakrabarti P... Homolog cDNA Strasbourg-FA A.thaliana cDNA TEST1, Hunan a d u l t T e s t i s t i s s u e cDNA CLONTECH cDNA 1 i b r a r y CCRF-CEM, c.. cDNA Ra147.1 A. t h a l iana Homology: gbl cDNA MHB3MA Cot8-HAP-Ft H.sapiens Ho... cDNA Stratagene cDNA l i b r a r y Hunan hea... cDNA CLONTECH cDNA l i b r a r y CCRF-CEM, c... cDNA Hunan pancreatic i s l e t H.sapiens

...

...

... ...

... ...

...

.

...

...

165 165 +2 134 87 +2 61 -1 58 +3 +I 57 45 +1 -3 54 -3 53 +2 48 -2 37 53 +3 40 +2 53 -1 40 -3 +3 43

Smal l e s t Poi sson Probability N P(N)

l.le-17 l.le-17 1.2e-12 l.le-06 0.17 0.51 0.56 0.57 0.87 0.93 0.94 0.97 0.97

0.97 0.97 0.98 0.991

1 1 1

1 1 1 1 2 1

1 1 2 1 2 1 2

1

>gnl idbestl21208 cDNA Lambda-PRL2 A.thaliana Homology: pirlA42286lA42286 human Score: 221 pVal: 9.5e-27 Length = 397 ERD-2-like p r o t e i n , ELP-1

-

Plus Strand HSPs: Score = 165 (77.7 b i t s ) , Expect = 1.le-17, P = l . l e - 1 7 I d e n t i t i e s = 34/53 (64%), P o s i t i v e s = 39/53 (73%), Frame = +1 Query: Sbjct:

1

MNPFRILGDLSHLTSILILIHNIKTTRYIEGISFKTPTLYALVFITRYLDLLT 53

MN FR GD+SHL S+LIL+ I T+ G I S KTP LYALVF+TRYLDL T 94 MNIFRFAGDMSHLISVLILLLKIYATKSCAGISLKTPELYALVFLTRYLDLFT 252

Figure 2. Example of t h e result of a TBLASTN search of dbEST using the amino acid sequence of the yeast erd2 gene product as a query. The output of t h e search has been abbreviated by omission of additional descriptive information.

dbEST accession number, which we have used for the examples in Tables I1 and 111. In the following example, we retrieved only the top two Arabidopsis ESTs. The retrieval is done by sending the following e-mail message to

It is also useful to compare the deduced amino acid sequence encoded by the six frames of the EST to the protein data bases. In this case the unintempted nucleotidl. sequence is placed behind the following message to the BLAST server:

[email protected]

PROGRAM BLASTX DATALIB NR BEGIN >COMMENT

RPT 21208 34427 Once the sequences are retrieved, it is worthwhile to complete severa1 additional searches. First, the nucleotide sequences should be compared to the data bases to determine if additional parts of the cDNA clone are represented in the data base. This is probably best done by appending the uninterrupted nucleotide sequence as the fifth and subsequent lines to the following message and sending it to the BLAST server at [email protected]: PROGRAM BLASTN DATALIB NR BEGIN >COMMENT

On-Lhe Similarity Results

The simplest method to determine if an EST i!; available for a particular gene product is to use one of the newly developed WWW servers to perform a text search of a11 the dbEST records. This service exploits the fact that when each new EST is deposited in the dbEST data base, thc, six-frame translation and the nucleotide sequence of the EST is compared to the known or deduced protein and niicleic acid sequences of a11 known genes and gene products and a report

Arabidopsis cDNA Sequences of significant homologies is appended to the EST sequence. Each time the sequence data bases is updated, the EST records are also reanalyzed and updated. The WWW servers permit one to search the entire record of a11 ESTs for the presence or absence of specific combinations of words. A consequence of this is that if one uses the word Arabidopsis to search dbEST, you recover not only a11 the Arabidopsis sequences but also a11 ESTs from other organisms that show homology to an Arabidopsis gene or gene product. However, by using the word cress only Arabidopsis ESTs are recovered because only primary Arabidopsis records contain the common name of the species (i.e. thale cress). The WWW is an information retrieval system that provides on-line 'point-and-click" access to documentation and multimedia information through hypertext links. A description of how to access the WWW is beyond the scope of this article. However, interested readers should consult a very intelligible introduction to the WWW that describes how to obtain the appropriate software and some features of some of the other molecular biology servers available on the WWW (Appel et al., 1994). To perform a search of the EST data base dbEST, open the URL HTTP://WWW.NCBI.NLM.NIH.GOV. When the NCBI top page opens, select dbEST by clicking on the hypertext link (the blue-colored word dbEST), then select search dbEST and enter the desired search conditions. For instance, to retrieve the Arabidopsis ESTs with homology to kinases, enter cress and kinase. In addition to the archival sequence information contained in dbEST, analytical results from similarity searches performed at the University of Minnesota are available via a new WWW server. The information presented in this server represents a subset of the information being developed to support the Arabidopsis sequencing effort at Michigan State University. In this data base project, raw data from the MSU Arabidopsis EST project is uploaded, prepared, and analyzed with a variety of tools, including automatic remova1 of vector sequences, similarity searches with BLASTN and BLASTX, and the detection and analysis of low-complexity regions of sequence. Reports from these analyses are indexed, and hence are searchable by terms found in the descriptions of similarities generated by the BLAST suite of programs. Hence, a search tenn can be entered in the field and a11 clones having similarity to this search term in their descriptions will be displayed. A hypertext link from the clone name accesses a complete report for that clone, which includes the information as to whether vector information was detected and removed, the percentage of ambiguous bases in the sequence, a summary of a11 hits generated by BLASTN and BLASTX (in addition to the complete reports), whether low-complexity regions were detected, and if so, the report of BLASTP reanalysis of the sequence. Links are also made from the reports to the full data base entry and related information presented by the NCBI WWW server. In addition to the textual report information, a graphical, nontext representation of these reports is currently under development, and a view of this graphical presentation will be included as a supplement to the textual information. The ArabidopsislMN WWW server can be accessed at URL HTTP://LENTI.MED.UMN.EDU (page down and click on the Arabidopsis icon to access the EST server). This server is

1253

at an early stage of development and is, therefore, undergoing constant modifications. New capabilities will be added with regularity, particularly in the area of new similarity tools and querying abilities. lnterpreting Scores

The BLAST programs report the degree of sequence similarity between an EST and the sequences in the public data bases by three summary numbers, the score, the expect value, and probability (Fig. 2). The derivation of these numbers and the underlying design of the BLAST programs has been described by Altschul et al. (1990). A nontechnical summary of what the score means can be obtained by sending an email message with the word help to blast@ncbi. nlm.nih.gov. In brief, the BLAST score is calculated by painvise mapping of each amino acid (or small groups of amino acids) from a segment of the query sequence onto a gap-free segment of the subject sequence. A 20 x 20 substitution matrix is used to assign an integer value to each aligned pair of amino acids based on a model of the probability of exchanging one amino acid for another. The most widely used substitution scores are variations of the PAM matrix, which is a statistical summary of amino acid substitutions observed in groups of closely related proteins from more than 70 superfamilies of proteins (Dayhoff et al., 1979). The score is the sum of the integer values of the aligned amino acids from the region of the sequences with the highest similarity. The expect value reported for each HSP is the number of times an HSP of equal or greater score is expected to occur by chance alone during the data base search. Thus, the total length of the data base figures into this estimate. The Poisson P-value for any given HSP is a function of its expected frequency of occurrence (due to chance) and the number of HSPs observed against the same data base sequence with scores at least as high. The expect and probability (P) values reported for HSPs are dependent on numerous factors, including the scoring scheme employed, the residue composition of the query sequence, an assumed residue composition for a typical data base sequence, the length of the query sequence, and the total length of the data base. The value of the ESTs resides in the high frequency with which a hypothetical identity can be assigned to an EST by a comparison of the deduced amino acid sequence of the EST with the data base of known proteins. For the 1500 sequences reported here, possible homology to a known protein was obtained for more than 30% of the ESTs using a cutoff score of 80. A key question facing anyone who proposes to investigate an EST that shows similarity to a known protein is what criterion of significance should be used? Empirical studies of this question have generally indicated that BLASTX, BLASTP, or TBLASTN scores of approximately 80 or higher are worth further investigation (Pearson, 1991). However, there are many different ways to generate a score of 80. As noted by Pearson (1991), one often finds that two sequences share a region of 15 to 25 amino acids with 70% identity. This raises the question as to whether such a small region of high similarity is more significant than finding a 50-amino acid region with 35% identity. Although there are no concise rules for resolving these and related questions,

1254

Newman et al.

Pearson (1991)outlines some of the additional computational criteria that may be applied in the analysis of apparent similarity. Sources of Potential Problems

I

To be cost effective, high-throughput sequencing is inevitably prone to certain problems that demand attention by users of ESTs. When an EST clone is received from the stock center, a partia1 nucleotide sequence should be obtained from the ends of the clone to verify that the correct clone has been received. Since some of the tissues used for construction of the cDNA libraries were not sterile, some of the cDNA clones may be from contaminating organisms. This can be checked by using the clone to probe an Arubidopsis genomic Southem blot at high stringency. Although the overall accuracy of the sequence information is estimated to be >99% over the first 300 bp, sequencing errors are present and can cause problems in designing PCR primers. However, an analysis of the effects of sequence errors has indicated that such errors do not significantly reduce the probability that the EST sequence can be identified by BLAST searches (States, 1992). For some of the ESTs, sequence information is available from both ends of the clone. Therefore, when an EST report is obtained, it may be useful to search dbEST for possible sequence information for the other end of the clone by using a hypothetical clone ID based on the rules for clone names described in "Materials and Methods" (e.g. the other end of clone 38E4BT7 is 38E4BXP). Finally, there are sequences in the data bases that have been misclassified with respect to function. Thus, in the event of apparent homology between an EST and a protein of "known function," it is essential to critically evaluate the evidence supporting the assignment of function to the reference clone. Because the ESTs are public information, the informal mechanisms that have been used to minimize duplication of effort by different laboratories with similar interests are no longer adequate. Since the resources of the EST data bases are freely available, it is inevitable that different laboratories may simultaneously undertake projects to exploit the same genes. To try and minimize the frequency of needless duplication, the Arubidopsis Biological Resource Center (ABRC) maintains a data base of a11 requests for EST clones that contains the identities of anyone who requests an EST clone so that different groups with similar interests can make contact. This information can be obtained by logging on to the ABRC on-line data base AIMS, either directly or via a GOPHER server (for information, send a help message to inquire-aims@genesys .cps.msu.edu). Future Directions

One implication of the EST projects is that within the foreseeable future, the sequences of most or a11 of the genes in several plants will be available in public-access data bases. Thus, much of the effort that is currently expended at cloning genes by various criteria may be obviated by methods based on data base analysis. In view of this, it would be prudent for anyone embarking on a project to isolate a new gene to first evaluate the possibility that it could be identified by

PIant Physiol. Vol. 106, 1994

some criterion in a data base. Conversely, since we can now envision a day when a11 the cDNA sequences Df several plants will be available in data bases, it may beconie increasingly worthwhile to simply pick an anonymous cDNA that is not homologous to any known sequence and to design experiments to deduce the function of the conesponding gene. Although this approach may seem radical at present, it is not fundamentally different than solving thc chemical structure of a metabolite and then designing experiments to deduce the role of the metabolite. Indeed, in organisms such as yeast, where the complete sequence of the genome will soon be available, this approach has already b1:en implemented (Oliver et al., 1992). From the preliminary results presented here and elsewhere, it is apparent that the function of approximately 70% of the Arabidopsis genes cannot currently be deduceè. solely by sequence analysis. Although the proportion of c.nidentified genes will continually decrease because of progress in identifying the function of plant and nonplant genes by other means, additional developments will be required to provide information conceming gene function. One way to add information to large numbers of ESTs is to correlate the genetic map position of the ESTs with the map locatioris of mutations. More than 800 genetic loci have been marked by mutation in Arabidopsis and the list of registered IoCUs names is growing rapidly (Dennis et al., 1993). The ESTs can be genetically mapped by using one of the sets of recombinant inbred lines that are available from the Arabii!opsis Stock Centers (Reiter et al., 1992; Líster and Dean 1993:l or, in some cases, by hybridzing the ESTs to genetically miipped yeast artificial chromosomes containing Ara bidopsis D NA (Last et al., 1991). Although not a11 of the Arabidopsis yc,ast artificial chromosomes have been aligned with the genetic: map as yet (Hwang et al., 1991; Matallana et al., 1992), this approach avoids the necessity of identifying a polymorphism for each EST and is, therefore, suitable for large-scale mapping of a11 the ESTs. Indeed, the very act of hybridizing laige numbers of ESTs to the yeast artificial chromosome libraIies will lead to the identification of a complete set of overkipping yeast artificial chromosomes that span the genome and are anchored to the genetic map (Matallana et al., 199 2). The final component of genome technology that will be required to fully exploit the ESTs is the ability to use an EST to create a mutation that eliminates the fun8:tion of the corresponding gene. The use of antisense technology is a useful step in this direction and the development of f a d e new transformation techniques for Arabidopsis (Bechtold et al., 1993) have made the creation of antisense, plants very simple. However, because of the limitations of antisense technology, a high priority should be placed on the development of facile methods for directed gene disruption (Somerville, 1993). ACKNOWLEDCMENT

We thank Elliot Meyerowitz for identifying one of the sources of possible artifacts in the EST sequences.

Received July 13, 1994; accepted August 23, 1994. Copyright Clearance Center: 0032-0889/94/106/1241/1.5.

Arabidopsis cDNA Sequences LITERATURE ClTED Abeles AL, Austin SL (1991) Antiparallel plasmid pairing may control P1 plasmid replication. Proc Natl Acad Sci USA 8 8 9011-9015 Adams MA, Dubnick M, Kerlavage AR, Moreno R, Kelley JM, Utterback TR, Nagle JW, Fields C, Venter JC (1992) Sequence identification of 2,375 human brain genes. Nature 355 632-634 Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, Xiao H, Merril CR, Wu A, Olde B, Moreno RF, Kerlavage AR, McCombie WR, Venter JC (1991) Complementary DNA sequencing:expressed sequence tags and the human genome project. Science 252: 1651-1656 Altschul SF, Gish W, Miller W, Myers EW, Lipman D (1990) Basic local alignment search tool. J Mo1 Biol215: 403-410 Appel RD, Bairoch A, Hochstrasser DF (1994) A new generation of information retrieval tools for biologists: the example of the ExPASy WWW server. Trends Biochem Sci 19: 258-260 Bechtold N, Ellis J, Pelletier G (1993) In planta Agrobacteriummediated gene transfer by infiltration of adult Arabidopsis thaliana plants. CR Acad Sci Paris 316 1194-1199 Boguski MS, Lowe TMJ, Tolstoshev CM (1993) dbEST database for 'expressed sequence tags." Nature Genet 4: 332-333 Dayhoff MO, Schwartz RM, Orcutt BC (1979) Survey of new data and computer methods of analysis. In MO Dayhoff, ed, Atlas of Protein Sequence and Structure, Vol5, Suppl3. National Biomedical Research Foundation, Washington, DC, pp 1-9 Dennis L, Dean C, Flavell R, Goodman H, Koornneef M, Meyerowitz E, Shimura Y, Somerville C (1993) The multinational coordinated Arabidopsis thaliana genome research project progress report: year three. U.S.National Science Foundation Publication NSF 93-173 Gibson S, Somerville CR (1993) Isolating plant genes. Trends Biotechnol 11:306-313 Green P, Lipman D, Hillier L, Waterston R, States D, Claverie JM (1993) Ancient conserved regions in new gene sequences and the protein databases. Science 259 1711-1716 Hofie H, Desprez T, Amselem J, Chiapello H, Caboche M, Moisan A, Jourjon MF, Charpenteau JL, Berthomieu P, Guerrier D, Giraudat J, Quigley F, Thomas F, Yu DY, Mache R, Raynal M, Cooke R, Grellet F, Delseny M, Parmentier Y, Marcillac G, Gigot C, Fleck J, Philipps G, Axelos M, Bardet C, Tremousaygue D, Lescure B (1993) An inventory of 1152 expressed sequence tags obtained by partial sequencing of cDNAs from Arabidopsis thaliana. Plant J 4 1051-1061 Hunkapiller T, Kaiser RJ, Koop BF, Hood L (1991) Large-scale and automated DNA sequence determination. Science 254 59-67 Hwang I, Kohchi T, Hauge B, Goodman H, Schmidt R, Cnops G, Dean C, Gibson S, Iba K, Lemieux B, Arondel V, Danhoff L, Somerville CR (1991) Identification and map position of YAC clones comprising one third of the Arabidopsis genome. Plant J 1: 367-374 Keith CS, Hoang DO, Barret BM, Feigelman B, Nelson MC, Thai H, Baysdorfer C (1993) Partia1 sequence analysis of 130 randomly selected maize cDNA clones. Plant Physiol101: 329-332 Last RL, Bissinger PH, Mahoney DJ, Radwanski ER, Fink GR (1991) Tryptophan mutants in Arabidopsis: the consequences of duplicated tryptophan synthase genes. Plant Cell 3 345-358 Lee H, Gal S, Newman TC, Raikhel NV (1993) The Arabidopsis

1255

endoplasmic reticulum retention receptor functions in yeast. Proc Natl Acad Sci USA 9 0 11433-11437 Lister C, Dean C (1993) Recombinantinbred lines for mapping RFLP and phenotypic markers in Arabidopsis thaliana. Plant J 4 745-750 Matallana E, Bell CJ, Dunn PJ, Lu M, Ecker JR (1992) Genetic and physical linkage of the Arabidopsis genome. Zn C Konz, NH Chua, J Schell, eds, Methods in Arabidopsis Research. World Scientific, Teaneck, NJ, pp 144-170 McCombie WR. Adams MD. Kellev IM. FitzGerald MG. Utterback TR, Khan M; Dubnick M, Kerl&age AR, Venter JC, Fields C (1992) Caenorhabditis elegans expressed sequence tags identify gene families and potential disease gene homologues. Nature Genet 1: 124-131 Meyerowitz E (1994) Structure and organization of the Arabidopsis thaliana nuclear genome. In E Meyerowitz, CR Somerville, eds, Arabidopsis. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY, (in press) Nagy F, Kau SA, Chua NH (1988) Analysis of gene expression in transgenic plants. In SB Gelvin, RA Schilperoot, eds, Plant Molecular Biology Manual. Kluwer Academic, Boston, MA, pp B4:l-29 Oliver SG, et al. (1992) The complete sequence of yeast chromosome 111. Nature 357: 40-47 Palazzolo MJ, Hamilton BA, Ding D, Martin CH, Mead DA, Mierendorf RC, Raghavan KV, Meyerowitz EM, Lipshitz HD (1990) Phage lambda cDNA cloning vectors for subtractive hybridization, fusion-protein synthesis and Cre-loxP automatic plasmid subcloning. Gene 8 8 25-36 Park YS, Kwak JM, Kim YS, Lee DS, Cho MJ, Lee HH, Nam HG (1993) Generation of expressed sequence tags of random root cDNA clones of Brassica napus by single-run partial sequencing. Plant PhysiollO3 359-370 Pearson WR (1991) Identifying distantly related protein sequences. Curr Opinion Struct Biol 1:321-326 Reiter RS, Williams JGK, Feldman KA, Rafalski JA, Tingey SV, Scolnik PA (1992) Global and local genome mapping in Arabidopsis by using recombinant inbred lines and random amplified polymorphic DNA. Proc Natl Acad Sci USA 8 9 1477-1481 Sankhavaram RP, Parimoo S, Weissman SM (1991) Construction of a uniform abundance (normalized) cDNA library. Proc Natl Acad Sci USA 8 8 1943-1947 Sasaki T, Song J, Koga-Ban Y, Matsui E, Fang F, Higo H, Nagasaki H, Hori M, Miya M, Murayama-Kayano E, Takiguchi T, Takasuga A, Niki T, Ishimaru K, Ikeda H, Yamamoto Y, Mukai Y, Ohta I, Miyadera N, Havukkala I, Minobe Y (1994) Toward cataloguing a11 rice genes: large-xale sequencing of randomly chosen rice cDNAs from a callus cDNA library. Plant J 6 615-624 Somerville CR (1993) New opportunities to dissect and manipulate plant processes. Proc R SOCLond B 339 199-206 States DJ (1992) Molecular sequence accuracy: analysing imperfect data. Trends Genet 8 52-55 Uchimiya H, Kidou S, Shimazaki T, Takamatsu S, Hashimoto H, Nishi R, Aotsuka S, Matsubayashi Y, Kidou N, Umeda M, Kato A (1992) Random sequencing of cDNA libraries reveals a variety of expressed genes in cultured cells of rice (Oryza sativa L.). Plant J 2 1005-1009 Waterston R, Martin C, Craxton M, Coulson A, Hillier L, Durbin R, Green P, Shownkeen R, Halloran N, Metzstein M, Hawkins T, Wilson R, Berks M, Du Z, Thomas K, Thierry-Mieg J, Sulston J (1992) A survey of expressed genes in Caenorhabditis elegans. Nature Genet 1: 114-123