fever was spread by ship to ports as far north as Boston and as far east as En- gland, where mortality rates in an epi- demic could exceed 20 percent of those.
RESEARCH ARTICLE
Nucleotide Sequence of Yellow Fever Virus: Implications for Flavivirus Gene Expression and Evolution Charles M. Rice, Edith M. Lenches, Sean R. Eddy Se Jung Shin, Rebecca L. Sheets, James H. Strauss
The Flavivirus genus, family Flaviviridae, consists of a group of some 70 closely related human or veterinary pathogens causing many serious illnesses, including dengue fever, Japanese encephalitis, St. Louis encephalitis, Murray Valley encephalitis, tick-borne encephalitis, and yellow fever (1). Most
fever was spread by ship to ports as far north as Boston and as far east as England, where mortality rates in an epidemic could exceed 20 percent of those contracting the disease. Walter Reed and colleagues in pioneering studies in Cuba in 1900 demonstrated that yellow fever is transmitted by mosquitoes, and 2 years
Abstract. The sequence of the entire RNA genome of the type flavivirus, yellow fever virus, has been obtained. Inspection of this sequence reveals a single long open reading frame of 10,233 nucleotides, which could encode a polypeptide of 3411 amino acids. The structural proteins are found within the amino-terminal 780 residues of this polyprotein; the remainder of the open reading frame consists of nonstructural viral polypeptides. This genome organization implies that mature viral proteins are produced by posttranslational cleavage of a polyprotein precursor and has implications forflavivirus RNA replication and for the evolutionary relation of this virus family to other RNA viruses.
flaviviruses are transmitted to vertebrate hosts by blood-sucking arthropods, mosquitoes or ticks, although some evidently lack an arthropod vector (2). Arthropodtransmitted flaviviruses replicate in the arthropod host as well as the vertebrate host. Human flavivirus diseases have diverse and complex pathologies and different viruses exhibit marked tissue tropisms. Many are neurotropic, causing encephalitic symptoms; others, such as the dengue group, replicate preferentially in host macrophages, whereas yellow fever is usually viscerotropic. The disease known as yellow fever has been recognized for several hundred years (3, 4). Until the early 1900's recurrent epidemics occurred in the Caribbean area which caused great human suffering and had a profound influence on human activities in the area. From its focus in the Caribbean, epidemic yellow C. M. Rice, E. M. Lenches, and J. H. Strauss are members of the Division of Biology, California Institute of Technology, Pasadena 91125. S. R. Eddy and S. J. Shin are students at the California Institute of Technology and R. L. Sheets is doing graduate work in the Department of Cellular, Viral and Molecular Biology, University of Utah, University Medical Center, Salt Lake City 84132. 726
later showed that the disease agent is filterable (5). With the recognition that the mosquito Aedes aegypti is the vector for urban yellow fever, mosquito control measures rapidly led to the elimination of urban yellow fever. Subsequently, a safe and effective attenuated vaccine strain (17D) was developed by in vitro passage of the virulent Asibi strain in chicken embryo tissue (6). However, the virus persists in a sylvan cycle in the forests of South America and Africa, transmitted by numerous mosquito species including those of the genus Haemagogus in South America and of the genus Aedes in Africa. The vertebrate hosts in this cycle appear to be almost exclusively primates, demonstrating the limited natural host range of yellow fever. From the sylvan cycle periodic outbreaks in neighboring human populations have arisen on both continents. Furthermore, since Aedes aegypti is widespread in the world, a situation exacerbated by relaxation of mosquito abatement procedures in the Caribbean and elsewhere, the potential exists for future epidemics of urban yellow fever.
Previous studies have shown that flaviviruses contain single-stranded infectious RNA (thus defining them as plusstranded RNA viruses in which the irnon RNA serves as a messenger) encapsidated in a nucleocapsid possessing icosahedral symmetry and containing a single species of capsid protein [C, apparent mass of about 14 kilodaltons (kD)]. This in turn is surrounded by a lipid bilayer containing an envelope protein (E; about 50 to 60 kD) that is usually but not invariably glycosylated (7) and a second, nonglycosylated protein (M; about 8 kD) (8, 9). How the envelope is obtained is unclear, as budding flaviviruses are seldom identified in electron microscopic studies, although maturation does appear to occur in association with intracellular membranes (9, 10). Replication of flaviviruses in tissue culture is slow, with a long latent period, and only moderate titers of virus are produced. Host cell protein and RNA synthesis are shut off only poorly (vertebrate cells) or not at all (mosquito cells), making study of flavivirus replication and structure somewhat more difficult. Virus-specific protein synthesis appears to be associated with the rough endoplasmic reticulum, and RNA replication is localized in the perinuclear region (11). No subgenomic RNA has been detected in cells infected with flaviviruses, and it is believed that the genomic length RNA which is capped but not polyadenylated (12, 13) is the only messenger RNA (mRNA) species (9, 12, 14). This mRNA is translated into the three structural proteins and several nonstructural proteins. Translation of the flavivirus genome in vitro produces polypeptides related to the structural proteins (15) which, in the presence of appropriate membrane fractions, can be processed efficiently to yield C and E (16). Peptide mapping of in vitro translation products as well as selective incorporation of Nformylmethionine suggest that initiation in vitro occurs only with the capsid protein. Alternatively, studies on the in vivo translation of flavivirus Kunjin have been based on the use of pactamycin or high salt inhibition of translation initiation (17) or ultraviolet inactivation of translation (18) in an attempt to map the genome order of flavivirus proteins on the assumption that there is just a single site for initiation of translation. These experiments have led Westaway and collaborators to suggest that multiple independent translation initiation sites are used within flavivirus RNA, a situation not typically found with other eukaryotic mRNA's (19). We now present the complete nucleoSCIENCE, VOL. 229
* |s P|&o t|P * | P*r fl*9|{fioe* rson*
CAP
AGUAAAUCCUSUOUGCUAAUUGAGGUGCAUUGOUCUGCA"UCGASUUGCUASSCAAUAAACACAUUUGGA.UUAAUUUUAAUCGUUCGUUGAGCGAUUASCAGAGAACUGACCAG
115
116
M IB5 6 R K A 0 K T V L V R R V R S L S N K I K O K T K 0 I G N N R P G AACJkIGUCUGOUCGUAAAOCUCABGGAAAAACCCUBSBCBUCAAUAUBGUACGACGAGGABUUCBC :uUCCUUGUCAAACAAAAUAAAACAAA.AAACAAAAAC,CAAAUUGGAAAAACAGACCUOBAA
39 235
40 238
p S R G V 0 G F I FF F L f N I L T G K K I T A H L K R L W K N L D P R O G L A ccuL UCAAGAGGUGUUCAAGGAUUUAUCUUUUUCUUUUUGUUCAACAUUUUGACUGGAAAAAAGAUC ACAGCCCACCUAAAGAGGUUGUGGAAAAUGCUG IGACCCAAGA CAAGGCUUGGCU
355
80 356
AS v L R K V K R V V A S L L S S R KR R S H 0 V L T V 0 F L I L G M L L M T GUUCcCUAAGGAAAGUCAAGAGAGUGGUGGCCAGUUUBAUGAGABBAUUGUCCUCAAGGAAACGCCGU JlUCCCAUGAUGUUCUGACUGUGCAAUUCCUAAULULUUGGGAAUGGCCUGUUGAUGACGG
119 475
120 478
V T L L N G L V T L VY R K N R S E 0 L G K T F S V G T G N C T T N I L E GGUG1-GGAGUGACCUUGGUGCGGAAAAACAGAUGGUUGCUCCUAAAUSUCACAUCUGAGGACCUCGGGAAAACAUUCUCUGUGGGCACAGGCAACUGCACAACAAACAUUUUGGA,AGCCAAG
159 595
10 596
Y W C P 0 S M E Y N C P N L S P R E E P 0 0 I 0 C W C Y G V E N V R V A Y G K C UACUGGUGCCCAGACUCAAUGGAAUAC AACUGUCCCAAUCUCA¢UCCAAGAGAGGAGCCAGAUGACAUUGAUUGCUGGUGCUAUGGGGUGGAAAACGUUAGAGUCGC AUAUGGUAAGUGU
199 715
200 716
NT 6 RNM G E R 0 L D S A O R S R R S R RFgMI D L PTr H E N H G L K T R OE K W GACUC.AGCAGGCAGGUCUAGGAGGUCAAGAAGGGCCAUUGACUUGCCUACGCAUGAAAACCAUGGUUUGAAGACCCGGCAAGAAAAAUGGAUGACUGGAAGAAUGGGUGAA AGGCAACUC
239 835
240 836
O K I E R W F V R N CAA AGAUUGAGAG AUGGUUCGUGAGGA C
P F F A V T A L T I A Y L V G S N N T O R V V I A L L V L A CCUUUUUUGCAGUGACGGCUCUGACC AUUGC CUACCUUGUGGGAGC A ACAUGAC GCAAC GAGUC GUGAUUGCCC U ACUGGUCUUGGCU
279 955
280 956
V G A Y S A H C I G I T O R D F I E G V H G G T W V S A T L E O O K C V T V M GUUGGUCCGGCCUACUCAGCUCACUGCAUUGGAAUUACUGACAGGGAUUUCAUUGAGGGGGUGCAUGGAGGAACUUGGGUUUCAGCUACCCUGGAGC AAGACAAGUGUGUCACUGUUAUG
319 1075
prM
A
~E
* P
K
I
79
320 1078
O K C A P 0 K P S L O I S L E T V A I D R P A E V R K V C Y N A V L THY V K N GCCCCUGACAAGCCUUCAUUGGACAUCUCACUAGAGACAGUAGCCAUUGAUAGACCUGCUGAGGUGAGGAAAGUGUGUUACAAUGC AGUUCUCACUC AUGUGAAGAUUAAUGA.CAA.GUGC
359 1195
380
1196
P S T G E A H L A E E N E G O N A C K R T V L F G K G S I V S O R G W G N G C O CCCAGCACUGGAGAGGCCCACCUAGCUGAAGAGAACGAAGGGGACAAUGCGUGCAAGCGCACUUAUUCUGAUAGAGGCUGGGGCAAUGGCUGUGGCCUAUUUGGGAAAGGGAGCAUUGUG
1315
400 1316
0O L H V G A K 0 E N W N T S L F E V D0O T K A C A K F T C A K S N Y V I R I GCAUGCGCCAAAUUCACUUGUGCCAAAUCCAUGAGUUUGUUUGAGGUUGAUCAGACCAAAAUUCAGUAUGUCAUCAGAGCACAAUUGCAUGUAGGGGCCAAGCAGGAAAAUUGGAAUACC
439 1435
440 1436
O N S Y O K A T L E C O V 0 T A V D F 0 I K T L K F O A L S G S G E V E F I G Y GACAUUAAGACUCUC AAGUUUGAUGCCCUGUCAGGCUCCCAGGAAGUCGAGUUCAUUGGGUAUGGAAAAGCUACACUGGAAUGCCAGGUGCAAACUGCGGUGGACUUUGGUAACAGUUAC
479 1555
480 1556
D R OW A 00 I A E M E T E S W I V L T L P W O S G S G G V W R E M H H L V E F AUCGCUGAGAUGGAAACAGAGAGCUGGAUAGUGGACAGACAGUGGGCCCAGGACUUGACCCUGCCAUGGC AGAGUGGAAGUGGCGGGGUGUGGAGAGAGAUGCAUC AUCUUGUCGAAUUU
519 1675
520 1876
E P P H A A T I R V L A L 6 N O E G S L K T A L T G A M R V T K D T N D N N L Y GAACCUCCGCAUGCCGCC ACUAUCAGAGUACUGGCCCUGGGAAACCAGGAAGGCUCCUUGAAAACAGCUCUUACUGGCGCAAUGAGGGUUACAAAGGACACAAAUGAC AACAA CCUUUAC
559 1795
560 1796
F V K N P T K G T O Y K I C T O K M F K L H G G H V S C R V K L S A L T L D T G A AACUACAUGGUGGACAUGUUUCUUGCAGAGUGAAAUU6UCA6CUUUGACACUCAAGGGGACAUCCUACAAAAUAUGCACUGACAAAAUGUUUUUUGUC AAGA ACCCAACUGAC ACUGGC
599 1915
600
H O T ,V V M O V K V S K G A P C R I P V I V A D0 L T A A I N K G I L V T V N P C AUGGCACUGUUGUG-AUGCAGGUGAAAGUGUCAAAAGGAGCCCCCUGCAGGAUUCCAGUGAUAGUAGCUGAUGAUCUUAC AGCGGCAAUCAAUAAAGGC AUUUUGGUUAC AGUUAACCCC
2035
640 2036
I A S T N 0 0 E V L I E V N P P F G D S Y I I V G R G O S R L T Y O W H K E G S AUCGCCUCAACCAAUGAUGAUGAAGUGCUGAUUGAGGUGAACCCACCUUUUGGAGACAGCUACAUUAUCGUUGGGAGAGGAGAUUC ACGUCUCACUUACC AGUGGCAC AAAGAGGGA AGC
679 2155
680 2156
O V E R L A V M G OT A W O F S S A G G F F T S V G K G I S I G K L F T O T N K UC AAUAGGAAAGUUGUUCACUCAGACCAUGAAAGGCGUGGAACGCCUGGCCGUCAUGGGAGACACCGCCUGGGAUUUCAGCUCCGCUGGAGGGUUCUUCACUUCGGUUGGGAAAGGAAUU *~~~~~~~~~~~~~~~
719 2275
399
639
M
N
720 2276
CAUACGGUGUUUGGCUCUGCCUUUCAGGGGCUAUUUGGCGGCUUGAACUGGAUAACAAAGGUCAUCAUGGGGGCGGUACUUAUAUGGGUUGGCAUCAACACAAGAAACAUGACAAUGUCC 0 o
759 2395
760 2396
N S N C A I N F G K R E L K C G0 I L V G V I M N F L S L G V G A G I F I AUGAGCAUGAUCUUGGUAGGAGUGAUC AUGAUGUUUUUGUCUCUAGGAGUUGGGGCGGAUCAAGGAUGCGCCAUCAACUUUGGCAAGAGAGAGCUCAAGUGCGGAGAUGGUAUCUUCAUA
799 2515
800
2516 840 2636
H
T V
F
G
S A
F
O
G
L
F G
G
L
N
W
I
T
K
V
I
H
G A
V
L
I
W
V
G
I
N
T
R
N
T
M
S
~~~~~~~~rNSI ) G OO
F
R D
S
OON W
L
N
K
Y
S
Y
Y
P
E D
P
V
K
L
A
S
I
V
K
A
S
F
E
E
G
K
C
G
L
N
S
V
0
UUUAGAGACUCUGAUGACUGGCUGAACAAGUACUCAUACUAUCCAGAAGAUCCUGUGAAGCUUGCAUCAAUAGUGAAAGCCUCUUUUGAAGAAGGGAAGUGUGGCCUAAAUUCAGUUGAC G ODP
A D
879 2755
K
919 2875
O G
959 2995
K N V Y O R G S L E H E M W R S R E I N A I F E E N E V O I S V V V UCCCUUGAGCAUGAGAUGUGGAGAAGCAGGGCAGAUGAGAUCAAUGCCAUUUUUGAGGAAAACGAGGUGGACAUUUCUGUUGUCGUGCAGGAUCC AAAGA AUGUUUACCAGAGAGGAACU
0
V O W
0
K
S
D
C
R
839 2635
T
2756
H P F S R I R D G L K T W G K N L V F S P G R K N G S F I I G CAUCCAUUUUCC AGAAUUCGGGAUGGUCUGCAGUAUGGUUGGAAGACUU6GGGGUAAGAACCUUGUGUUCUCCCC AGGGAGGAAGAAUGGAAGCUUCAUCAUAGAUGGAAAGUCC AGG^AAA
920 2876
GAAUGCCCGUUUUCAAACCGGGUCUGGAAUUCUUUCCAGAUAGAGGAGUUUGGGACGGGAGUGUUCACC AC ACGCGUGUACAUGGACGCAGUCUUUGAAUACACCAUAGACUGC GAUGGA
960 2996
UCUAUCUUGGGUGCAGCGGUGAACGGA^AAAAGAGUGCCCAUGGCUCUCCAACAUUUUGGAUGGGAAGUCAUGAAGUAA AUGGGACAUGGAUGAUCCACACCUUGGAGGC AUUAGAUUAC
999 3115
1000 3116
K E C E W P L T H T I 6 T S V E E S E M F M P R S I G G P V SS H N H I P G Y K AAGGAGUGUGAGUGGCCACUGAC AC AUACGAUUGGAACAUCAGUUGAAGAGAGUGAAAUGUUCAUGCCGAGAUCAAUCGGAGGCCCAGUUAGCUCUCACAAUCAUAUCCCUGGAUAC AAG
3235
1040 3238
0 T N G P W M O V P L E V K R E A C P G T S V II DO G N C O G R G K S T R S T V GUUCAG AC GAACGGACCUUGGAUGC AGGUACCACUAGAAGUGAAGAGAGAAGCUUGC CCAGGGACUAGCGUGAUC AUUGAUGGCAACUGUGAUGGACGGGG AAAAUC AAC CAGAUCC AC C
1079 3355
1080 3356
C R S C T N TO SOK V I PEW P P V S F H G S O G C N Y P N E I R P R K T H ACGGAUA GCGGGAA^AGUUAUUCCUGAAUGGUGUUGCCGCUCCUGCACAAUGCCGCCUGUGAGC UUCC AUGGUAGUGAUGGGUGUUGGUAUCCC AUGG AAAUUAGGCCA^AGGA AA ACGC AU
3475
1120 3476
E S H L V R S W V T A 0 E I H A V P F G L V S M N I A M E V V L R K R 0 G P K 0 GAAAGCC AUCUGGUGCGCUCCUGGGUUAC AGCUGGAGAAAUACAUGCUGUCCCUUUUGGUUUGGUGAGCAUGAUGAUAGCAAUGGAAGUGGUCCUAAGGAAAAGACAGGGACCAAAGCAA O
1159 3595
860
1160
E S
C
P
I L
L
V
D A
M
M
F
G
G
S A
N A
R V
G
V
V
N
A
L
V
N
L
W
G
L
N K
S F
0
I
E
B A N G
K
G
A N
A
F
L
V
G
E
S
F P
O V
G T
T
T
F
L
G W
L
V M
F T O S
D L
L
T H
K
R
E
L
V
V
T
Y N
V
NM
G
A
A T
W
V M
-ns 2a V G L
F E I
H
H
F
Y T
H
T
L
E
I
E
M
A
N
VY
L
N
G
G
1039
1119
1199 3715
3596
AUGUUGGUUGGAGGAGUAGUGCUCUUGGGAGCAAUGCUGGUCGGGCA^AGUA ACUCUCCUUGAUUUGCUGAAACUCAC AGUGGCUGUGGGAUUGC AUUUC CAUGAGAUGAAC AAUGGAGGA
1200 3716
GACGCCAUGUAUAUGGCGUUGAUUGCUGCCUUUUCAAUCAGACCAGGGCUGCUCAUCGGCUUUGGGCUCAGGACCCUAUGGAGCCCUCGGGAACGCCUUGUGCUGACCCUAGGAGC AGCC
1240 3838
AUGGUGGAGAUUGCCUUGGGUGGCGUGAUGGGCGGCCUGUGGAAGUAUCUAAAUGCAGUUUCUCUCUGCAUCCUGACAAUAAAUGCUGUUGCUUCUAGGAAAGCAUC
N T I L AA AUACCAUCUUG
1279 3955
1280 3956
P L M A L L T P V T M A E V R L A A M F F C A V V I I G V L H O N F K 0 T S M O CCCCUC AUGGCUCUGUUGAC ACCUGUC ACUAUGGCUGAGGUGAGACUUGCCGCAAUGUUCUUUUGUGCCGUGGUUAUCAUAGGGGUCCUUCACC AGA AUUUCAAGG ACACCUCCAUGC AG
1319 4075
1320 4076
AAGACUAUACCUCUGGUGGCCCUCACACUCACAUCUUACCUGGGCUUGACACAACCUUUUUUGGGCCUGUGUGCAUUUCUGGCAACCCGCAUAUUUGGGCGAAGGAGUAUCCCAbGUGA AU
4195
1380
4196
G E M E N F L G P I A V G OG L M M L V S V A L A F E A L A A A G L V G V L A O G AGGC.ACUCGC AGC AGCUGGUCUAGUGGGAGUGCUGGCAGGACUGGCUUUUCAGGAGAUGGAGAACUUCCUUGGuccGAUUGCAGUUGGAGGA CUC CUGAUG AUGCUGGUUAGCGUG GCU
4315
1400 4316
GGGAGGGUGGAUGGGCUAGAGCUCA^AGAAGCUUGGUGAAGUUUCAUGGGAAGAGGAGGCGGAGAUCAGCGGGAGUUCCGCCCGCUAUGAUGUGGC ACUCAGUGAAC AAGGGGAGUUC AA G
1439 4435
1440 4436
A L H P F A L L L V L A G W L F H V L L S E E K V P W O O V V N T S L A L V G A CUGCUUUCUGA AGAGA AAGUGCCAUGGGACC AGGUUGaUGAUGACCUCGCUGGCCUUGGUUGGGGCUGCC CUC CAUCCAUUUGC UCUUCUGCUGGUC CUUGCUGGGUGGCUGUUUC AUGUC
1479 4555
1480
ROA^ R RFs GD V LW D I P T P K I I E E C E H L E D G I Y G I F O S T F L G A AGGGGAGCUAGGAGAAGUGGGGAUGUCUUGUGGGAUAUUCCCACUCCUAAGAUCAUCGAGGAAUGUGAACAUCUGGAGGAUGGGAUUUAUGGCAUAUUCCAGUCAACCUUCUUGGGGGCC
1519 4875
1520
1559 4795
4798
O A F L V R N G K K L I P S W A S V K N W H V T R V A 0 0 G Y F H T S O R G V UCCC AGCGAGGAGUGGGAGUGGCACAGG6AGGG6UGUUCCACACAAUGUGGCAUGUCACAAGAGGAGCUUUCCUUGUCAGGAAUGGCAAGAAGUUGAUUCCAUCUUGGGCUUCAGUAAAG E E E V O L I A A V P G K N V V N V 0 T K P E D L V A Y G G S W K L E G R W D0 G AAG ACCUUGUC GCCUAUGGUGGCUC AUGGAAGUUGGAAGGC AGAUGGGAUGGAGAGGAA^GAGGUC CAGUUG AUCGC GGC UGUUC C AGGAAA^G AAC GUGGUC AACGUCC AGACA AAAC CG
1600 4916
A V A L D Y P S G T S G S P I V N R N O E V I G L Y G N S L F K V R N G G E I AGCUUGUUC AAAGUGAGGAAUGGGGGAGAAAUCGGGGCUGUCGCUCUUGACUAUCCGAGUGGCACUUCAGGAUCUCCUAUUGUUAACAGGAACGGAGAGGUGAUUGGGCUGUACGGCAAU
1640
O K E E L 0 E I P T H L K K G M T T V G I L V G 0 N S F V S A I S O T E V K E E GGC AUCCUUGUCGGUGACAACUCCUUCGUGUCCGCCAUAUCCCAGACUGAGGUGAAGGAAGAAGGAAAGGAGGAGCUC CAAGAGAUCCCGACAAUGCUAAAGAAAGGAAUGACAACUGUC
1879
5038
1680
1719
5156
P 0 1 L A E C A R R R L R T L V L A P T R V V L S E CUUGAUUUUCAUCCUGGAGCUGGGAAG AC AAGACGUUUCCUC CC ACAGAUCUUGGCC GA GUGCGC ACGGAG ACGCUUGC GC ACUCUUGUGUUGGCCCCC ACC AGGGUUGUUCUUUCUGAA
5275
1720 5276
Ff S A H G s G R E V I D A M C H A T L T Y R M L E A F H G L D V K F H T O AUGAAGGAGGCUUUUC ACGGCCUGGACGUGA AAUUCC AC AC ACAGGCUUUUUCCGCUC ACGGCAGCGGGAGAGAAGUC AUUGAUGCC AUGUGCC AUGCC ACCCUAACUUAC AGGAUGUUG
1759 5395
17W0
A E PT R V V NW E V II M D E H F L GAACC1AACUAGGUUGUUA9ACUGGGAAGUGAUCAUUAUGGAUGAAGC8CA5UUUUGGAUCCAGCUAGCAU:eseAGAGGUUGGCAGCGCACAGAGCUAGGGCAAAUGAAAGUGCA 5515
4556 4676 1560
5398
M V
K
G
T
R
I.
V
O F
L
M
E
Y
I
P
0
H
A
L
G
P
L
V
L
G
G
I G
A L
E
A
L
G
A V
T
K
K
M
L
K
T
G
T
L
R
S G
B
I
L W
Y
B E
R
R
F
P K
L B
V
S
G LL Y
L
N
L
T
E
N
O
E
I A
P
E
G V
F
A
F S
L
E
0
L
0
I
L
C
L
S
R
I
C
G
T
L
A
s
L T
F
s
W
I
L
A
S N
A
R
P A
T
Y
R
V
R
0
E R A
I
V
L
S R
F
A
G
L
V
K
R
S
L A
R
E
T
S
S
0
L
I
G
G A
P
E
V
F
A
N
K
L
K
23 AUGUSTr 1985
A
O P
A
S
I
A
A R O N
A
A
N R
A
R
N
E
S
A
1239 3835
1359 1399
1599 4915
1639 5035
5155
1799
727
1800 5516
T
1840
IL:
5636 1880
I
N
L
T
A
P
T
P
G
T
S
E
0
P
F
S
H
N
G
I
E
E
D
0
V
T
I
0
P
E
5
P
W
G
T
N
14
0
W
A D
K R
P TA
FL:
W
P SI
R
A
A
N V
A S
MA
R
L
A G
K
K S
V
V
L
V
N
R
K
T
F
E
R E
PT
V
I
O K
K
P D
K
F
I L
TO0
A
IA
E M
G A
N
C V
L
E R
O CR
L
V
T A
F
K
P
1960
V
V
DESG
L V
R
V
K
K G
A
P
L
R I
S A SS
A
O R
A
G
R
R IG
P
R N
N
R D
GD0
S
V
S E PT
V
SE N
A
N
H
H
C
V
L EA
W
L
S M
L
O N
M E
V
G
R
V
G M
A
PL Y
G
V
E
2040 6236
K
T
G
T
K
P
P G
V S
E M
R
L
RODSD
R
V
K
R E
F
L V
C DL
R N
W
V
P
L
S W
OV A
AOGL
K
O R
T N
W
K
F E
C
G
P EE
H
ElI
L
SE T
O S
N
V
CR;APSG
K
G A
K
K
P
2080 6356
2120
P
0
F
L
A
K
M L
F
I L
S
E
A
N
0
I
S
V
A G
L
L
T
SGNM
V
I1FF
S
K
T
F
L
S
E
E
S
R
A
V
M S
P
K
G IS
R
M S
H
S
R
N
S
L
A
N
P
E
A
N
T
T M
A
S C
S
V
N
2240 6836
V
2280 6956
AS
2320
N
MNFL
GSG
V
K
P T
ISVY
H
M ANS
2199
V
I1FF
M L
L
V
N
V
V I
V
P EP
GOG'
S IOD
R
NOG
2239 6835
A
V
I
L
I
S
L
I
T
V
L
S
A
A
V
A
N
E
S
L
M-LI E
T
K
E
K
0
L
F
S
K
N
K
L
I
P
5
5
2279 6955
WSWPL
L
K
P G
A
A
G SA
S
V
L
C
G IG
C
A
N
P T
V'
M T
V
V
S I
V
T M
L
S
S F
WODKSG
I
P
N
K
N
N
I SV
L
H
M S
L
PSGI K
A
DI E
E
A
A
V
P
N
H
H
W I
INM
L
L V
L
K
V
VSG
E
2319
7075
SSGI
L
S
2360 7196
IT;
V
2400
A
E N
P
2440 7436
V
A
C
2480 7556
NT
S
V
M R
S N
H
2520
RE;
L
N
L
O K
R
O F
E L
L
F
7076 N P
L
L
V
OSG
L I
OO0S K
LAO
G W
N
S
2359
G
V
2399
LA S
2439
A
V S
2479
rMIns4b F H
R ,V
R
7195
7315
V
N
7316
PA;L
PENM
K
L A
LVY
L
S
PL IESG
NT
S
L
L W
N
GP M
SA rGNS5
N
O K
T
L
G0EV
W
K
2519
ESG
K
V
0
T
S
V
2559
W
C
V
V
A
A
A
2599
K
O K
TO0
I
H
2639
2679
Y E
K
L
L A
L
S
7435
PFSL
ES
I V
LA S
A
A L
7555
V
A
F
VSG
N
V
V
TSGR
N
L W
K
N
K
TOD
I V
E
V
D R
V
V
K
L EG
H E
K
P M
S S SS
R
7675
L
V
KR
0
T AR
R H
LA:
7795
S R
ST
A
K
L
R
W
F
H ERS
OK;
E
V
SSG
V
K
S F
T
L
G
2640 8036
R
E
P
V
K
CODT
L
LC;D
ISG
2680 8156
CS;
VO0N
F
C
V
LA;P
V
2720 8276
N
2760 8396
E
2800
F
2560 7796
A
2600 7916
V
R
V IODL
N
V
O SLOG
V
T ESS
L
E L
S C
G
RO;G
7915
RO0G
W
lITI
N
F
8035
L
E S
R
T V
R
R
R
GO
V
LODT
V
E
K
W
L A
T V
RtN
P
L
S R
8155
V
K
N P
D V
L
E K
LO0
F
I
2719 8275
S TH
EN
V
GSA
V V
R
S N
V T
F
NO;T
T V
S R
L
L
M R
RNM
P PT
R
0 KV
T
L
2759 8395
DV I
A
L
P1SGT
R
VETOD
S
KSG
KS;
PLO
A
SRE
IS
VESRI
EVY
K
M T
S W
2799 8515
VODN
8516
D
N P
V
R
T
W
H
V
C
O SY V VT
K
SOGS
T
A
A
SNM
V
NOG
I1K
V
I L
TVY
P
W
2839
8835
2840 8636
DR;
2880 8756
IN;K
2920 8876
LE;E
2960
N
ISE EV
T
RNM
A
N
R
WML
F
R H
TODT
P
F
G000
R
R E
K
N
P R
L
GODP
K
F
T
V
F
K
S KV
0
T
R
A
K
O P
P
ASGT
C T
K
SEEF
I A
K
V
R
S
H
A
A
R
C R
R
K
ISGA
V
2919
T
V
V
2959
SF
L
V
A
T
2879 8755
V
V
N
L A
8875
G
EO0W
K
T A NS
R E
K
K
L
S EF
R E
N
SO;
A
V
M EL
V
HOGS0
D E ER
K
G A RY
L
C
8995
N
N
G K
8996
GSK
A
O S
R
AIM
V
GIG
L
0
V
L
S
DOEOSEI
J.
N Y
N
S R
RD
K
N M WL
FSE
L
A L
2999 9115
NED
W AS
H
9116
GO
VSE
VI
V
RD0
LA A
NOM0
SO
G
F
V
N
S M
3039 9235
3040 9236
00D
3080 9356
V
K
N
K
V
V
K
V
L
R
PA:P GSOK
A
V
N
3120 9476
IT;
N
L
K
V
0
L I
R
MA;E
V I
H
3160 9596
H
S C
3200
0
I
T AG
MOD
TRAITE
A
O LD
S PH
H
K
K
L AG
A
3079 9355
DVI
ORG
S GO
D ES
V
V
V
R
L
TVY
AL
N T
3119
9475
A
E M
HO0
H
V
00D
C
L
T
S AW
L
T
S
3159 9595
OR L K RNM A V s G DZD C V V R P1I ol 0F F S L A L S H L N A N S KV R K A CACGAUGGACAACUAAGAGAUGCGGGA GAGACACUGUGUGGUCCGGCCCAUCGAU GACAGGUUCGGCCUGGCCCUGUCCCAUCUCAACGCCAUGUCCAAGGUUAGAAAG S
S
U
0
P
5
K
S
U
U
S
VS; PSG
N
N
0
9716
N
V
P
F
C
S
H
F 'H
S
L
0
T A
C L
S K
A
H
L
K
0
0
R
I
V
V
P
C
R
L
N Y
F
H
K
R
3199 9715
3239
9835
3240 9836
EGOS0
3280 9956
RON
3320 10078
L
S
3380 10198
LIONG
728
2159
6715
L
10798
2119 6475
6595
IV;
2200 6716
00558
2079
6355
-
6476
10438
2039
6235
~~~~-~ns4a K' LUGRCCPRGGUGCDGAUERAG VUG SSUC GSCCG ALGGCGSEGAUUF UAAUUGCGAFGAGAGES GGRGAGCGUAAGEGV UGLV GUV UGLGSEACL
3400 10318
1999
6115
2000 6116
3000
1959
5995
5996
7676
1919 5875
1920 5876
6596
1879 5755
5756
2160
1839
5635
L I
S R
SR
OWNM
I KE
V AN
MUG
3279 9955
R
L
L
S L AV
SS;
A
V
P T
SW
V
POOGR
T
TMW
S I
H
0G
SUNEW
T
T ED
N
3319 10075
V
M N
R
V
M IT
N
N
PHN MO
O 0K
T M
V
K
K
M R
D V
P
V
L
T
K R
00DK
L
CBGS
3359
10095
TN
RAAT
W
AS;H
INH
L VI
HRA
I
R
T
L I
G0ESKVY TO
V
L
T V
NO0R
V
3399
10315
S
V
D AD
L
0
LOS
L I
3400
10435 00555
100875
10882
VOL. 229 ~~~~~~~~~~~~~~~~~~~~~~~~SCIENCE,
tide sequence of the yellow fever genome determined from complementary DNA (cDNA) clones of the 17D
5' 116 nt CAP4
Yellow fever 17D genome (10.862 nt) 10,288 nt 511 nt 3' l- natruntursu 1tru / Cotranslational processing j (Viral proteases?)
vaccine strain. Together with recent /^ NH2-terminal sequence analysis of both (Signalase?) structural (20) and some nonstructural a AYA ORR V RR a VGA D ORR S ARR ORRa yellow fever proteins, the amino acid C prM E NSI NS3 ns4a ne4b ns2a ne2b NS5 sequences of the encoded proteins have |, (Golgi protease?) SRR A been deduced and a preliminary picture G ? M of flavivirus gene organization and expression has begun to emerge. Fig. 2. Organization and processing of proteins encoded by the yellow fever genome. Sequence of yellow fever RNA. The Untranslated regions are shown as single lines and the translated region as an open box. The complete sequence of yellow fever RNA open triangle is the initiation codon (AUG); the solid diamond the termination codon (UGA). protein nomenclature is described in Table 1 and (35). The single letter amino acid code is is shown in Fig. 1. The 5'- and 3'- The used for sequences flanking assigned cleavage sites (solid lines). Two other potential cleavage terminal sequences presented were de- sites are shown as dotted lines. Structural proteins, identified nonstructural proteins, and rived from several independent clones, hypothesized nonstructural proteins (see text) are indicated by solid, open, and hatched boxes, are homologous to the 5' and 3' termini respectively. Other potential cleavage sites have been found and are described in Table 1, of West Nile flavivirus genomic RNA footnote asterisk. (21) (see below), and thus probably reflect the extreme ends of the yellow fever genome. Given these assumptions, possible reading frames (two in the viri- in agreement with in vitro translation the RNA genome is 10,862 nucleotides in on RNA and three in the complementary data from the flavivirus genomic RNA's length and has a mass of 3.75 x 106 RNA) reveals multiple stop codons in of tick-borne encephalitis virus, West daltons (expressed as the sodium form). every case, with the longest possible Nile virus, and Kunjin virus (15, 16), the Previous reports have shown that flavi- other open reading frame being 804 nu- translation of the yellow fever genome virus genomic RNA contains a type 1 cleotides (in the complementary strand). initiates with the capsid protein, and the cap at the 5' terminus but lacks a polya- Thus there is no reason to expect that NH2-terminal methionine is removed denylate tract at the 3' terminus (12, 13). any protein is translated from yellow during maturation of the protein (20). The base composition of the RNA is 27.3 fever RNA other than the polyprotein The capsid protein may be released from percent A, 23.0 percent U, 28.4 percent encoded by the long open reading frame the precursor polyprotein by cleavage at G, and 21.3 percent C. shown in Fig. 1. or just past a series of basic amino acids It is striking that the RNA contains an The structural proteins ofyellow fever (Figs. 1 and 2). From this deduced amino extremely long open reading frame, virus. The start points of the three yel- acid sequence, the capsid protein is quite which spans virtually the entire length of low fever virus structural proteins (C, M, basic containing about 25 percent lysine the genome. This open reading frame, and E) have been positioned within the and arginine distributed throughout the beginning from the first AUG triplet, is translated RNA sequence from NH2-ter- protein. The capsid protein of tick-borne 10,233 nucleotides in length, terminating minal amino acid sequences obtained for encephalitis virus contains a similar prowith a single opal codon (UGA), and the structural proteins isolated from yel- portion of basic amino acids (22). Since could encode a polypeptide of 380,763 low fever virions (20) (Fig. 1). The capsid the capsid protein forms complexes with daltons, leaving 5'- and 3'-noncoding re- protein is the first protein found in the the RNA, its highly basic character probgions of 118 and 511 nucleotides, respec- long open reading frame and begins one ably acts to neutralize some of the RNA tively. Examination of the remaining five residue past the first methionine. Thus, charges in such a compact structure. Fig. 1 (preceding page and opposite page). Entire sequence of the genome of yellow fever virus. Yellow fever virus, 17D vaccine strain, was obtained from the American Type Culture Collection. This sample represents in vitro passage 234 of the line originated by Theiler and colleagues who started with the virulent Asibi strain (6). After plaque purification in Vero cells and amplification in BHK cells, the virus was grown in SW13 monolayers (50) and purified by polyethylene glycol precipitation, in glycerol-tartrate gradients. The purified virus was diluted with aqueous buffer and sedimented in the ultracentrifuge; the RNA was isolated by phenol extraction (51). Briefly, single-stranded cDNA was synthesized with avian myeloblastosis virus reverse transcriptase using degraded calf thymus DNA for priming (47). Second strand synthesis was carried out essentially as previously described (52). After methylation of the Eco RI sites with Eco RI methylase, phosphorylated Eco RI linkers were added with T4 DNA ligase. Following complete digestion with Eco RI, the double-stranded cDNA was sized on an agarose gel and selected size fractions were inserted into the Eco RI site of a plasmid vector derived from pBR322. Colonies containing yellow fever-specific inserts were selected by colony hybridization and were characterized by restriction mapping to obtain clones which represented most of the yellow fever genome. Clones containing the 3' end of the genome were constructed by poly(A)-tailing (polyadenylation) the genomic RNA with Escherichia coli poly(A) polymerase followed by synthesis of double-stranded cDNA with an oligo(dT) primer. Addition of the poly(A) tract was relatively inefficient but after digestion of the double-stranded cDNA with Bgl I, 3'-terminal Bgl I fragments were selectively cloned with a plasmid vector derived from cloned yellow fever DNA (51). Clones containing the 5' end of the genome were constructed by primer extension followed by oligo(dC) tailing with terminal deoxynucleotidyl transferase and oligo(dG) primed second strand synthesis. The entire sequence was obtained by chemical sequencing of both strands of the DNA (53). In addition, sequence was obtained throughout from at least two clones. Wherever the sequence differed between two clones (due presumably to heterogeneity in the RNA population or errors introduced during cloning), a third and occasionally a fourth clone was sequenced in this area, and the preferred nucleotide is reported here. Nucleotides are numbered from the 5' terminus. Amino acids are numbered from the first methionine in the polyprotein sequence. The beginning of each protein is labeled (see Table I and text for nomenclature); tentative assignments are indicated by dashed arrows. Putative hydrophobic membrane-associated segments in the structural region are overlined. Potential N-linked glycosylation sites are denoted by an asterisk. The region of NS5 homologous to other RNA viruses (see text) is enclosed by brackets and the conserved Gly-Asp-Asp sequence is boxed. Repeated nucleotide sequences are underlined. Closely spaced in phase stop codons that terminate the long open reading frame are boxed. The single letter abbreviations for the amino acid residues are: A, alanine; C, cysteine; D, aspartic acid; E, glutamic acid; F, phenylalanine; G, glycine; H, histidine; I, isoleucine; K, lysine; L, leucine; M, methionine; N, asparagine, P, proline; Q, glutamine; R, arginine; S, serine; T, threonine; V, valine; W, tryptophan; Y, tyrosine. 729 23 AUGUST 1985
prM
4C
o i 41
E
LAW I&
-
200 ns2a
-4
r-ON
o
T0
.16
&a V''I ~ LII:. " iAV
-4
1400
4
V 600
400
ns2b
r.mm.
0-
.11i I. A I
(a) *6