supplementary materials massive factorial design ...

0 downloads 0 Views 14MB Size Report
Massive Factorial Design Untangles Coding Sequences Determinants of Translation Efficacy –Sup. Materials. 1 of 18. SUPPLEMENTARY MATERIALS.
SUPPLEMENTARY MATERIALS MASSIVE FACTORIAL DESIGN UNTANGLES CODING SEQUENCES DETERMINANTS OF TRANSLATION EFFICACY

Guillaume Cambray, Joao C. Guimaraes & Adam Paul Arkin

Supplementary Figures ................................................................................................................................. 2 Figure S1 – Definition of sequence properties of interest from the analysis of natural sequences ............................... 2 Figure S2 – Sequence logos of replicate factorial series ............................................................................................... 4 Figure S3 – Measurements of protein production under non-coupling conditions and relationship to design factors . 5 Figure S4 – Properties of the inducible translational coupling device and measurements of protein production under coupling conditions ........................................................................................................................................................ 7 Figure S5 – Increased codon adaptation directs improved protein production when translation initiation is not limiting ........................................................................................................................................................................... 9 Figure S7 – Examples of comparable structure profiles leading to different protein productions .............................. 10

Material and Methods ................................................................................................................................. 11 Genomic analysis ......................................................................................................................................................... 11 Sequence design ........................................................................................................................................................... 11 DNA synthesis ............................................................................................................................................................. 11 Plasmid construction .................................................................................................................................................... 12 Accessory plasmid. ................................................................................................................................................. 12 Reporter plasmid..................................................................................................................................................... 12 Library construction ..................................................................................................................................................... 13 Measurement of protein production ............................................................................................................................. 13 Growth conditions. ................................................................................................................................................. 13 Low throughput measurement of a reference panel. .............................................................................................. 13 Sorting of the population into fluorescence bins. ................................................................................................... 13 Preparation of sequencing libraries. ....................................................................................................................... 14 Processing of FACS-SEQ data. .............................................................................................................................. 14 Data processing, management and analysis ................................................................................................................. 15 Description of the consolidated dataset. ................................................................................................................. 15

References ..................................................................................................................................................... 18



Cambray et al. - Massive Factorial Design Untangles Coding Sequences Determinants of Translation Efficacy –Sup. Materials 1 of 18



SUPPLEMENTARY FIGURES

Figure S1 – Definition of sequence properties of interest from the analysis of natural sequences



Cambray et al. - Massive Factorial Design Untangles Coding Sequences Determinants of Translation Efficacy –Sup. Materials 2 of 18



All analyses used protein abundance data from Taniguchi et al. (2010; n=575) and the genome sequence of E. coli MG1655 (GI:48994873). (A) Nucleotide composition biases in coding sequences are related to protein expression. Pearson correlation coefficients between various nucleotide contents and measured protein abundances are plotted for windows of varying sizes and positions, as shown. Colors mark different nucleotide combinations (see bottom right legend). Grey background shadings separate subpanels that correspond to increasing starting position of the windows (see numbering below bottom panel). Within subpanels, consecutive points correspond to increase of the window size by one codon from a fixed starting position. Within each window, the three within-codon positions have been analyzed separately, as indicated. Due to the genetic code’s redundancy, the third codon position is less constrained and should provide a less biased indication of nucleotide influences on protein production. These data highlights the contribution of AT content (see panels B and C), as previously noted by Allert et al. (2010). The strongest correlations are seen at the second codon position for %A, %T, %C but not %G. According to Sjöström and Wold (1985), this particular pattern strongly suggests the contribution of hydropathic properties of the underlying amino-acids (see panels D and E). (B) Scatter plot of protein abundances against AT content in the window +4 to +21 used for further design (%AT). The correlation is weak over all codon positions, but stronger when only the third codon position is considered (see A). (C) Distribution of %AT binned by categories of protein abundances, as shown. No striking pattern differentiates the distributions. A single threshold was chosen for the discretization of this property into 2 ordinal levels, as indicated by the white line. It corresponds to the average %AT over all natural coding sequences in the reference E. coli genome. (D) Hydropathy is correlated with protein expression. The red line shows the mean hydropathy index over a sliding widow of 11 amino acids. The blue line shows corresponding correlations with protein abundances. Positions are given by amino acids. The grey vertical line marks the window chosen for design of the MHI property. (E) Distribution of MHI binned by categories of protein abundances, as shown. The low protein bin has a clear bimodal distribution. Two thresholds for the discretization of this property into 3 ordinal levels are indicated in white and correspond to the 15th and 75th percentiles of MHI over all natural coding sequences in the reference E. coli genome. (F) Scatter plot of protein abundance against CAI of whole coding sequences. Regression line is shown in red. Grey background shadings mark the 20th and 80th percentile of protein abundances used for categorization in the distributions (see G). (G) Distribution of CAI binned by categories of protein abundances, as shown. Two thresholds for the discretization of this property into 3 ordinal levels are indicated in white and correspond to the 15th and 85th percentiles of CAI over all natural coding sequences in the reference E. coli genome. (H) Distribution of codon ramp properties binned by categories of protein abundances, as shown. Plotted are absolute bottleneck positions (BtlP, left) and bottleneck relative strengths (BtlS, middle) for all natural coding sequences in the reference E. coli genome. Distribution of BtlS for sequences with BtlP upstream of codon 33 (the design threshold dictated by construction constraints; see IJ) is shown on the right. This latter plot guided the definition of a nested threshold for BtlS, corresponding to the 70th percentile for this property. (I) Engineering the codon ramp bottleneck in the sfGFP reporter. The profile of relative bottleneck strength for the original sfGFP reporter is shown in grey (sliding window of 20 codons). To engineer conditions wherein the variable sequence fused to the reporter could influence bottleneck properties, 22 codons clustered in 3 different region of the reporter sequence were mutated. The resulting profile is shown as bold green line and features a strong C-terminal bottleneck at position 232 and a moderate bottleneck at the beginning of the reporter. The strength of the latter can be modulated by the nature of the upstream designed sequence (see J). (J) Possible bottlenecks in the engineered reporter. Shown is a scatter plot of bottlenecks positions and strengths realized for a million random sequences of 32 codons fused to the engineered reporter. Bottleneck positions are contained within the first 33 codons or position 232, as intended. The red line shows the chosen nested threshold for BtlS, which is not exceeded by Cterminal bottlenecks. (K) Smooth variations in secondary structure strength around the start codon of natural coding sequences. Shown are boxplots of predicted minimum free energy for a window of 60 nts slid by steps of 5 nts around the start codon. Colored boxes highlight the windows chosen for design. Background shadings mark the 10th, 25th, 50th, 75th and 90th structure percentiles for randomly generated sequences. Structures in 5’UTRs to be less stable than expected by chance, while structure within genes instead tend to be more stable. (L) Distribution of structure’s predicted minimal free energies binned by categories of protein abundances, as shown. Two thresholds for the discretization of these properties into 3 ordinal levels are indicated in white and correspond to the 25th and 75th percentiles of the properties over all natural coding sequences in the reference E. coli genome.

Cambray et al. - Massive Factorial Design Untangles Coding Sequences Determinants of Translation Efficacy –Sup. Materials 3 of 18



Figure S2 – Sequence logos of replicate factorial series The 96 positions of the designed sequences are shown as a sequence logo for each of the 56 independent factorial series. Series identification numbers are shown on top. At each position, bases are arranged by decreasing frequencies from top to bottom, with sizes proportional to their frequency. Histograms on the side of each logo show the distribution of pairwise differences between sequences in the series at nucleotide (pink) and amino acid (blue) level. The consensus sequence is distinctly different for each series, as intended by design. Variations are well distributed all over the sequence, with some positions more variable than others. Contrasting with nucleotide differences, the distribution of pairwise amino acid differences is often multimodal. This behavior stems from the initial enforcement and eventual relaxation of constraints to favor synonymous mutations during the design process. A sizeable number of sequence variants within each series are synonymous. Cambray et al. - Massive Factorial Design Untangles Coding Sequences Determinants of Translation Efficacy –Sup. Materials 4 of 18



Figure S3 – Measurements of protein production under non-coupling conditions and relationship to design factors Cambray et al. - Massive Factorial Design Untangles Coding Sequences Determinants of Translation Efficacy –Sup. Materials 5 of 18



(A) Bulk high-throughput measurements of protein production are comparable to individually measured performances. Shown is a scatter plot of flow-cytometry data measured on individual cultures versus FACS-Seq raw data under noncoupling conditions. Points mark the mean of at least 3 biological replicates and grey arrows their standard deviation (n=310). The red line is a linear regression fit, excluding outliers (grey data points). Although the benchtop flow cytometer used for individual measurements is less sensitive than the more sophisticated FACS machines used for the high-throughputs experiments, we find excellent agreement between the two types of data (r=0.95). Most outliers show large standard deviation and probably correspond to the rise of variants mutated outside of the sequenced region in either assay. (B) High-throughput measurements of protein production are highly reproducible. Shown are pairwise scatterplots of processed protein production for 4 biological replicates. Points are first plotted in solid grey to render isolated outliers and then as transparent black to provide a sense of data density. Cells from replicates 1-2 and 3-4 were sorted with different FACS machine (see Material and methods). Replicates 1-3 and replicate 2-4 were pooled for sequencing. The reproducibility of the measurements is excellent (r=0.99 on average; individual correlation coefficients as shown). (C) Sizeable design error in the molecular Design of Experiment. Shown are the cumulative distributions of the coefficients of variation in PNC amongst experimental replicate (red) and the 3 close design replicates within each series (sequences 1-4 nts apart with identical factorial properties; blue). The design error is distinctly larger than the experimental error, testifying of the inability of the factorial categorization to fully capture functional variations between highly related sequences. (D) Factorial series characterized by higher phenotypic diversities tend to be better captured by design factors. The series-wise mean (left scatter plot) and variance (middle scatter plot) in PNC are plotted against the explanatory power (R2) achieved by all design factors and their second-order interactions in ANOVA. Red lines show a linear regression fits (correlation coefficients as shown). Higher mean protein production is associated with lower contribution of the design factors to the observed variance. In contrast, higher variance is associated with higher explanatory power of the design factors. Mean protein production and variance are moderately correlated (right scatter plot). Series that are not well explained by the design factors are those that fail to implement the intended phenotypic variability. In particular, too high mean protein production is likely symptomatic of failure to design functionally relevant secondary structure. (E) Series-wise decomposition of explainable variance. Same plot as Figure 3C but derived from multiple linear regressions (MLR) on continuous property scores, as opposed to ANOVA on discretized scores. Series order and color scheme are maintained for comparison. (F) Multiple Linear Regressions and ANOVA yields comparable results. Shown are scatter plots of series-wise explanatory powers obtained through MLRs versus ANOVAs. Left: total explanatory powers; Right: effect sizes for each design properties and their second order interactions (log scale). The largest contributions are highly correlated, although MLRs consistently provide slightly better results than ANOVAs. (G) Slight enrichment of codon-adapted sequences amongst highest protein producers. Shown is a scatter plot of CAI versus PNC, with data points colored by STR-30:+30, as shown. Dark lines represent the quartiles of CAI values for every percentile of PNC. Grey lines show the same quantities calculated over the whole library. Blue and red lines show linear regressions using data below and above the top pentile of protein producers, respectively (correlation coefficients as shown). (H) Manipulation of the codon ramp does not impact protein production. Left: Distribution of protein production according to the predicted position (BtlP) and strength (BtlS) of the translation bottleneck. Boxplots over light grey background show distributions by amino-acid position along the designed sequence. At each position, the blue box represents lower level BtlS strains, while red represents higher level. Box widths are quadratically related to the sample size. No systematic trend is apparent across these N-terminal positions. The two colored boxplots on medium grey background show pooled data across all N-terminal positions, broken down by BtlS level. The two grey boxplots show distributions binned by BtlP level. We observe no differences in protein production between these groups. Right: Scatter plot of BtlS versus PNC for N-terminal (black) and C-terminal (red) levels of BtlP. Grey and dark red lines show the corresponding quartiles for each percentiles of PNC. Unlike CAI, the strength of the codon ramp does not correlate with variation in the translation regime. (I) The effect of STR-30:+30 occults that of other properties. Shown is a scatter plot of the effect size of STR-30:+30 against that of the sum of all other properties and second-order interactions for each replicate factorial series. Stronger contributions of the initiationcontrolling STR-30:+30 diminishes the global impact of others properties.

Cambray et al. - Massive Factorial Design Untangles Coding Sequences Determinants of Translation Efficacy –Sup. Materials 6 of 18



Figure S4 – Properties of the inducible translational coupling device and measurements of protein production under coupling conditions (A) Influence of amber number and position on translational coupling inducibility. Population average fluorescence signals were measured by flow cytometry at mid-exponential growth under increasing dilution of unnatural amino acid (AcF). Position and number of amber stop codons was varied in a development version of the reporter system showing poor translation in the absence of coupling. Points and shaded backgrounds mark the means and standard deviations from 3 biological replicates (color as shown). The construct pGC4470, which bear single amber at the fifth codon of the leader sequence, provides greater induction though slightly lower repression (green line). Since ribosomes terminating at this position show minimal interference with STR-30:+30 (Figure 6A), this version of the device was retained for the final reporter. (B) Inducible translation coupling enable quantitative control of translation rate. Distribution of fluorescences measured by flow cytometry under increasing dilution of AcF (color as shown) for construct pGC4470 (green in panel A). (C) Inducing the unnatural suppressor system recapitulates the effect of sense and stop codons. The amber stop codon (TAG) was replaced by ochre (TAA) and other sense point-mutants (AAG, TAC and TTG) in the context of 10 reporter variants differing in sequence over the first 10 codons after the start codon. Variants show different expression patterns and are shown in order of increasing expression ratio (full over no induction). In the absence of AcF, amber behaves comparably to ochre, demonstrating little leakage and efficient termination. In general, expression levels attained under induction by 2.5 nM AcF are just slightly lower than that attained with sense codon, demonstrating high readthrough efficiency. Although translational coupling generally results Cambray et al. - Massive Factorial Design Untangles Coding Sequences Determinants of Translation Efficacy –Sup. Materials 7 of 18



in improved translation, it can have adverse effect when translation is already high in non-coupling conditions (e.g. construct pGC4788), presumably through competition with regular initiation. (D) Bulk high-throughput measurements of protein production are comparable to individually measured performances. Shown is a scatter plot of individual flow-cytometry data versus FACS-Seq raw data under coupling conditions. Points mark the mean of at least 3 biological replicates and grey arrows their standard deviation (n=310). The red line is a linear regression fit, excluding outliers (grey data points). Although the benchtop flow cytometer used for individual measurements is less sensitive than the more sophisticated FACS machines used for the high-throughputs experiments, we find good agreement between the two types of data (r=0.90). (E) High-throughput measurements of protein production are reproducible. Shown are pairwise scatterplots of processed protein production for 4 biological replicates. Points are first plotted in solid grey to render isolated outliers and then as transparent black to provide a sense of data density. Cells from replicates 1-2 and 3-4 were sorted with different FACS machine (see Material and methods). Replicates 1-3 and replicate 2-4 were pooled for sequencing. The reproducibility of the measurements is generally good, although the first replicate shows a somewhat inconsistent signal (r=0.87 on average; r=0.91, excluding replicate #1, individual correlation coefficients as shown). We retained that replicate for the calculation of PC because it provided valuable information nonetheless. (F) Lower design error under coupling. Shown are the cumulative distributions of the coefficients of variation in PC amongst experimental replicate (red) and the 3 close design replicates within each series (sequences 1-4 nts apart with identical factorial properties; blue). Unlike the situation under non-coupling conditions (Figure S3C), the design error is hardly distinguishable from the experimental error under coupling. This likely arises from the combined effect of lesser experimental reproducibility and lower variance in measured fluorescent across the library. However, a buffering effect of translational coupling on the source of Design Error cannot be ruled out (e.g. secondary structure mispredictions). (G) Series-wise decomposition of explainable variance. Same plot as Figure 4C but derived from multiple linear regressions (MLR) on continuous property scores, as opposed to ANOVA on discretized scores. Series order and color scheme are maintained for comparison. (H) Multiple Linear Regressions and ANOVA yields comparable results under coupling. Shown are scatter plots of series-wise explanatory powers obtained through MLRs versus ANOVAs. Left: total explanatory powers; Right: effect sizes for each design properties and their second order interactions (log scale). The largest contributions are highly correlated, although MLRs consistently provide slightly better results than ANOVAs.

Cambray et al. - Massive Factorial Design Untangles Coding Sequences Determinants of Translation Efficacy –Sup. Materials 8 of 18



Figure S5 – Increased codon adaptation directs improved protein production when translation initiation is not limiting (A) Protein production steadily increases with CAI in the elongation-limited regime defined under coupling. Shown is a scatter plot of PC against CAI colored by PNC, excluding the lowest decile of protein producer, which remains limited by initiation (Figure 5C). The transparent dark line is a linear regression (regression coefficient as shown). Grey lines mark the quartiles of protein abundance for every percentile of CAI. (B) Protein production slightly increases with CAI amongst the top protein expresser under non-coupling conditions. Shown is a scatter plot of PNC against CAI colored by PC for the top pentile of protein producers, which define the apparent elongation–limited regime under non-coupling conditions (Figure S3G).

Cambray et al. - Massive Factorial Design Untangles Coding Sequences Determinants of Translation Efficacy –Sup. Materials 9 of 18



Figure S7 – Examples of comparable structure profiles leading to different protein productions Minimum free energies of predicted secondary structures are plotted as a function of window position for windows of different length, as shown. Constructs with similar structure profiles (same row) can be found in distinct regions of the protein production space, as indicated by the red gates in a scatter plot of PC against PNC (top). Conversely, very different structure profiles can yield the same production phenotypes (same column). These profiles also highlight that different window length can yield highly dissimilar profile for a given construct (e.g. 50 and 70 nts-long windows on construct 71_33111112_1, middle-left plot).

Cambray et al. - Massive Factorial Design Untangles Coding Sequences Determinants of Translation Efficacy –Sup. Materials 10 of 18



MATERIAL AND METHODS Genomic analysis We used the genome of Escherichia coli str. K-12 substr. MG1655 (GI:48994873) as a reference. Annotated CDSs and preceding 99 nucleotides were extracted as a multifasta file using custom biopython scripts. Only CDSs longer than 99 nucleotides were considered. This file was used as input for the batch sequence analysis mode of D-Tailor, for which was had developed module extensions to explore and calculate the sequence properties of interests (Guimaraes et al., 2014). Ten shuffled versions of each wild type gene constrained to maintain UTR nucleotide composition and amino acid sequence were generated. This was used as null reference set for comparison purposes, when appropriate. We used functional genomic data from (Taniguchi et al., 2010) for preliminary correlation analyses between calculated sequence properties and mRNA and protein abundances. Sequence design We used the design mode of D-Tailor (Guimaraes et al., 2014) to derive factorial series from input seed sequences. Initially, 56 seed sequences of 96 nts were randomly generated and further evolved using a simple Monte-Carlo simulation to maximize the pairwise hamming distance between sequences within the set. Seed sequences were preand post-pended the sequence context of our expression system. Mutational processes used to generate variant sequences were first constrained to favor synonymous mutations. This constraint was relaxed to ease completion of the full-factorial design when the rate at which new combinations of properties were discovered became too low and at least two third of a series was completed. For some seeds, the discovery rate stalled rapidly during the design process, so that deriving the intended full factorial set was impractical. In such case, a new random seed was generated, evolved for maximal distance to other current seeds and submitted to D-Tailor. For a given seed, the design process stopped once all combinations of properties (1458) were obtained at least once. Because multiple sequence solutions were found for some combinations of properties, we used a simple heuristic procedure to select a full-factorial sequence set that minimize the average pairwise sequence distances within the series. We then derived two variants for each of the sequence in a given series by introducing 1-4 mutations at random, while maintaining the original combination of sequence properties. Each factorial series but one thus contains 4374 sequences. Due to space limitations on the synthesis chip, we did not derive a third variant for 944 sequences of seed #136, which thus comprises only 3430 sequences. Altogether, the final design library comprises exactly 244,000 sequences. Throughout the design process, we rejected sequences that contained useful restriction sites, potential promoters, terminators or internal ribosome binding sites. The D-Tailor modules implemented to analyze and evolve the sequence properties of interest are available upon request. DNA synthesis The 96 nts-long design sequences were prefixed and suffixed by 29 and 26 nt-long sequences, respectively. These sequences contain constant sequence context including restriction sites and a 24 nts priming sites necessary for PCR amplification upon synthesis. We used 3 orthogonal pairs of priming sites so that 3 subsets of the library could be independently amplified. Two subsets comprising 15,811 and 15,831 sequences, respectively, were defined so that each 20-mer in the synthetic sequences could be uniquely mapped to a particular constructs within each subset. The layout was originally to enable ribosome profiling of these subset, although we have pursued this application in this work. The third subset contains the remaining sequences (212,358). After these edits, the 151 nts-long sequences have the following generic form: FFFFFFFFFFFFFFFFFGGTACCATAATGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGGTGGATCCR RRRRRRRRRRRRRRRR where N correspond to the design sequence, F and R to the various forward and reverse priming sequence, respectively. Restriction sites for cloning of the synthetic sequences (KpnI and BamHI) are underlined. Bold face show the sequence targeted by PCR primers. Variable priming sequences are as follow: Forward Reverse

Group #1 Group #2 Group #3 Group #1 Group #2 Group #3

atcgatgtaccgtgatcggtacca catcgaagtcgctctaaggtacca cgtcctacttatggaagggtacca tggatccctgatgatgtagacagg tggatccgagagctgactatactg tggatcctgctacgatgtctgtca

Cambray et al. - Massive Factorial Design Untangles Coding Sequences Determinants of Translation Efficacy –Sup. Materials 11 of 18



Sequences were named with unique identifiers, written in a multifasta file and sent to Agilent laboratories for synthesis. The 244,000 sequences were synthesized as single-stranded DNA oligonucleotides on a single array using the OLS technology (Kosuri and Church, 2014). The procedure generated a total of ~90 ng of full length synthetic DNA, which was sent to us in a single tube (~3 ng/µL). Plasmid construction Accessory plasmid. We modified pEVOLPROBE-aaRS (a gift from Chang Liu, University of California, Irvine) which comprises a pBBR1 origin of replication and kanamycin resistance cassette derived from pBROBE (Miller et al.) and an orthogonal tRNA/aminoacyl-tRNA synthetase system derived from pEVOL (Young et al., 2010). pEVOLPROBE-aaRS permits to express an optimized tRNACUA and two copies of a cognate aminoacyl-tRNA synthetase (aaRS) gene evolved to selectively discriminate the unnatural amino-acid p-acetylphenylalanine (pAcF). One of the aaRS copy is controlled by a pBAD promoter, which is repressed by the product of the araC gene, also encoded on the plasmid. Upon addition of 0.1% arabinose (Sigma) and 10 mM pAcF (Synchem), tRNACUA are effectively charged with pAcF and outcompete RF1 for the decoding of the amber stop codon (UAG), resulting in the incorporation of the unnatural amino acid to the polypeptide chain and continuation of elongation (Young et al., 2010). We amplified a chloramphenicol resistance cassette from pBbE1c-RFP (Lee et al., 2011) and cloned it in place of the original kanamycin resistance cassette (NotI/XbaI) to yield pGC4510. We amplified the tev protease gene from pRK603 (Kapust and Waugh, 2000) and cloned it into pBbA2k (EcoRI/BamHI) (Lee et al., 2011), yielding pGC2109. A fragment containing anti-oriented tev and tetR genes driven by a bidirectional pTet promoter was amplified from pGC2109 and cloned in pGC4510 (NotI), to yield the accessory plasmid pGC4593. Addition of 200 nM anhydrotetracycline permits the expression of the TEV protease which recognize and cuts a specific polypeptide motif (Kapust et al., 2002).



Reporter plasmid. The reporter plasmid is build on a pFAB512 backbone (Cambray et al., 2013) which comprises a standard sfGFP expression cassette, a p15a origin of replication and a kanamycin resistance cassette. To increase the sensitivity of fluorescence measurements at weak protein production, p15a was replaced with a higher copy number origin of replication (ColE1, ~30 plasmids per cell) from plasmid pBbE1c-RFP (Lee et al., 2011) (AvrII/EcoRI). To ensure strong repression of the reporter’s transcription, we inserted the lacI gene amplified from E. coli BW25113 into the plasmid (BglII/XbaI) and modified its promoter in the process to reconstruct the LacIQ variant (Glascock and Weickert, 1998). We amplified mRFP1 followed by the rnaT1 terminator from pFAB809 (Cambray et al., 2013) and cloned it as a transcriptional fusion to the LacIQ gene (EcoRI/XbaI). We then randomly mutated the RBS associated with mRFP1 to tune its expression to a low but clearly detectable level. We used this fluorescent reporter to control for extrinsic noise during flow cytometry applications (Elowitz et al., 2002; Kosuri et al., 2013; Liang et al., 2012). This whole procedure yielded pGC4724. We extensively engineered the reporter system using pFAB512 as a template. First, we replaced the sfgfp original promoter by a strong synthetic promoter tightly repressed by a lac operator (Mutalik et al., 2013). To that end, we introduced two BsaI site by inverse PCR (iPCR) of the whole backbone (excluding the native promoter) and cloned the repressible promoter in the form of annealed oligonucleotides with matching overhangs. Second, we introduced a linker sequence at the very beginning of the reporter by another iPCR. That sequence comprises: i) a unique BamHI restriction site; ii) a flexible 3xGGS linker; iii) a 6xHis tag; and iv) a specific cleavage motif for the TeV protease (Kapust et al., 2002). Third, we used iPCR to insert a 57 nucleotide-long leader sequence, the stop codon of which overlaps with the start codon of the reporter and further introduces a perfect SD sequence in the reporter translation initiation region (Mutalik et al., 2013), as well as a unique KpnI restriction site. Fourth, we used site directed mutagenesis to introduce amber stop codons at various locations in the leader sequence. Based on experimental results, we selected the sequence with TAG at the fifth codon position as the one with the most efficient coupling properties (lower readthrough in non-coupling conditions and higher protein production in coupling conditions, data not shown). Lastly, we modified the codon usage of sfGFP. The goal of these modifications was to introduce a putative translation bottleneck in the last third of the gene’s sequence, as estimated using a tAI-based profile (Tuller et al., 2010). The strength of that terminal bottleneck was set to a moderately high value so that stronger and weaker bottlenecks could be encoded in the variable region (Figure S1IJ). Finally, we amplified a ccdB expression cassette from the gateway plasmid pDONR221 (Invitrogen) and cloned it between KpnI and BamHI to yield pGC4742, a counter-selectable acceptor plasmid for the reporter system. To obtain the final reporter plasmid pGC4750, we sub-cloned the engineered reporter system from pGC4742 into pGC4691 (BglII/AvrII). pGC4750 is propagated in the E. coli strain ccdB Survival 2 (Invitrogen).

Cambray et al. - Massive Factorial Design Untangles Coding Sequences Determinants of Translation Efficacy –Sup. Materials 12 of 18



Library construction Aliquots of about 1.5 ng of the library were amplified separately with each of the three pairs of primers using Phusion DNA polymerase (NEB) (15 cycles, 4 tubes of 100 µL each for a given reaction). Amplicons were separated by electrophoresis in a 4% agarose gel (NuSieve GTG, Lonza). We excised the bands at the expected size (151 bp) as well as larger fuzzier bands located just below 200 bp, recovered the DNA on a column (Zymoclean, Zymo) and eluted in water. Cleaned PCR products were digested overnight with KpnI-HF and BamHI-HF (NEB). Upon digestion DNA from the low and high band had the same size. The reporter plasmid pGC4750 was miniprepped (Qiagen) and digested overnight with KpnI-HF and BamHI-HF (NEB). The cut vector was resolved from the ccdB insert by electrophoresis in a 0.8% agorose gel and cleaned as above. Sanger sequencing of clones obtained from initial cloning tests shown that inserts from both isolated bands were equally good. For each subset of the library, we thus pooled the two extracts. For the library cloning, about 50 nmol of digested inserts and vector were mixed at a 1:1 molar ratio, as quantified using the Qubit dsDNA HS fluorometric assay (Life Technologies) and ligated overnight with T4 DNA ligase (NEB). Ligation products were dialyzed and electroporated in E. coli MDS42 recA (Scarabs Genomics) previously transformed with the accessory plasmid pGC4593. Upon electroporation, cells were recovered at 37°C for 1h without shaking, plated on large LB agar plates supplemented by kanamycin and chloramphenicol for plasmid selection and grown overnight at 37°C. Additionally, serial dilutions of the electroporations were plated on small agar plates to estimate transformation efficiency. Transformants were scrapped, resuspended in rich MOPS medium supplemented with kanamycin and chloramphenicol and homogeneized by shacking at 900 rpm (37°C) for 1h. 50 µL aliquots were then mixed with 15% glycerol and frozen at -80°C. We estimated from platting of dilution series that the three library subsets contained 1.25x107, 0.51x107, 1.21x107 individual clones, which represents 791, 322 and 57-fold coverage of the respective library subset (15,811, 15,831 and 212,358 sequences, respectively). Accordingly, we mixed the three homogenized cultures at 10:10:134 volumetric ratios and saved aliquots of this final library. About a third of the sequence reads obtained upon high-throughput sequencing of the final library contains mutations, with a majority of small deletions typical of synthetic DNA (Kosuri and Church, 2014). Such mutants are excluded from all analyses presented in this work. The frequency of individual clones varied over a 30-fold range within the library. These variations most likely reflect various construction biases, rather than deleterious effect of the cloned sequence because transcription of the reporter had never been induced at this stage. Measurement of protein production Growth conditions. Cells were grown overnight in 5 mL Rich MOPS medium (teknova) supplemented with kanamycin and chloramphenicol (for plasmid maintenance), IPTG (for induction of reporter transcription) and aTc (for induction of tev transcription). In coupling condition, the growth medium was further complemented with arabinose (for induction of unnatural acetyl synthase transcription) and the unnatural amino acid AcF (2.5 nM). Since addition of AcF results in acidification of the media, the pH was adjusted to its initial value of 7.2 by addition of NaOH. Low throughput measurement of a reference panel. During the construction of the library, individual colonies for picked at random, cultured independently in 96-well plates and Sanger sequenced. This procedure yielded a small sub-panel of the design library comprising 310 clearly identified strains. Upon growth as described above, single-cell green and red fluorescence intensities were measured using an automated Guava EasyCyte flow cytometer (EMD Millipore, Hayward, CA, USA) and processed as described previously (Cambray et al., 2013). All strains were measured at least in triplicate. Although deriving from a lower precision instrument, these data were used to benchmark the results from the high-throughput procedure described below (Figure S3A and S4D). Sorting of the population into fluorescence bins. In both coupling and non-coupling conditions, replicates #1 and #2 were sorted with a BD INFLUX and replicates #3 and #4 were sorted with a BD FACSARIA II. In both cases, events were tightly gated around the mean RFP fluorescence to control for extrinsic noise in protein production (the RFP gene is located on the same plasmid as the reporter gene and is also controlled by an IPTG-inducible promoter). The fluorescence range of the library in the green channel was then divided into 16 equally sized contiguous bins in the log space. The population was sorted through each of the 16 resulting gates in different tubes (4 sorting rounds of 4 tubes each). To ensure that the amount of sorted cell in each bin was proportionate to their phenotypic density in the initial population, the sort rate was maintained constant throughout the procedure and each of the bins was sorted for an equal amount of time. Given the size of the library, the actual sort rates were a limiting factor. Collection times varied between 12 hours (replicates #1 and #2) and 24 hours (replicates #2 and #3) to permit collection of enough cells (at least 100-fold the size of the library). To avoid phenotypic evolution of the sample between collection times, several cultures of the same replicate were seeded with delays matching their collection times. To amplify the sorted populations, we added an equal volume of LB medium to the sorted cells in PBS and complemented with appropriate concentration of kanamycin and chloramphenicol for plasmid maintenance. After the Cambray et al. - Massive Factorial Design Untangles Coding Sequences Determinants of Translation Efficacy –Sup. Materials 13 of 18



population reach saturation, we mimiprepped the plasmid DNA from a 2 mL aliquot (Qiagen) and quantified concentrations using nanodrop. Preparation of sequencing libraries. For each bins, we amplified 5 ng of extracted plasmids by PCR for 15 cycles. We used long oligonucleotides wherein the Illumina sequencing adapter, followed by defined 8 nts-long barcodes precedes the priming region. Random spacers of varying sizes were introduced between the standard priming site of the adapter and the barcodes to introduce complexity at the beginning of the sequence and improve the clustering step during the sequencing procedure. The primer sequences are detailed below. Combinations of two barcodes introduced by the forward and reverse primers are used to uniquely identify sample origin upon multiplexing. The amplicons were cleaned and size selected using magnetic SPRI beads (Agencourt). The quantity of purified DNA was measured using Qbit (Life technologies). We the mixed DNA samples originating from each of the binned population in amount proportionate to their expected diversities (i.e. the frequencies of the bins in the whole population, as estimated from the sorted cell numbers). Compatible libraries, i.e. amplified with different barcode combinations, were pooled by two in equal proportions (replicates #1 with #3 and #2 with #4) and sequenced on a HiSeq 2500 (Illumina) using the rapid mode (2x150 cycles). Primer name

Primer sequence AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTAGTAATAC CATGCACATAAGGAGGTACCATAATG sp0_bc1_fw AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTBCAAGATA TCATGCACATAAGGAGGTACCATAATG sp1_bc2_fw AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTNNTGTTTG GTCATGCACATAAGGAGGTACCATAATG sp2_bc3_fw AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTNNMTTCCG ACCCATGCACATAAGGAGGTACCATAATG sp3_bc4_fw AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNNAGAT AGTGCATGCACATAAGGAGGTACCATAATG sp4_bc5_fw AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNNMCTC GCCAGCATGCACATAAGGAGGTACCATAATG sp5_bc6_fw CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCTATCAC GAGACCCGCCCGATCCACCGGATCCACC sp0_bc1_rw CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCTBCGAT GTTCACCCGCCCGATCCACCGGATCCACC sp1_bc2_rw CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCTNNTTA GGCGAACCCGCCCGATCCACCGGATCCACC sp2_bc3_rw CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCTNNMTG ACCAATACCCGCCCGATCCACCGGATCCACC sp3_bc4_rw CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCTNNNNA CAGTGGTACCCGCCCGATCCACCGGATCCACC sp4_bc5_rw CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCTNNNNK CGCTATGTACCCGCCCGATCCACCGGATCCACC sp5_bc6_rw Primers for targeted multiplex sequencing (Illumina PE PCR Primer q.0 and 2.0 in pink and orange, respectively, with sequencing primers for read 1 and underlined; random spacers in red; custom barcodes in blue; priming region for targeted amplification of the designed sequences in grey) Processing of FACS-SEQ data. Sequencing reads were demultiplexed using custom python scripts. Sequences located in the design region were then mapped to the designed sequences using BWA (Li and Durbin, 2010). Data were compiled to produce a summary dataset that count the number of perfect read counts observed in each bins for each of the design sequence. Reads with mutations were saved in a separate file and excluded from further analyses. For each replicate experiment, the read frequencies observed in each bin were adjusted to match those expected based on the cell densities observed during sorting: 𝑓!,!,! =

!!,! !!!" !!! !!,!

× 𝑆!,!

(eq. 1)

where, for replicate r, 𝑓!,!,! is the corrected frequency for sequence s in bin i; 𝑅!,! is the total read count in bin i across all sequences; and 𝑆!,! is the observed frequency of cell sorted in bin i during the FACS procedure for that replicate. This correction is necessary to correct for unavoidable DNA quantification and loading error when multiplexed samples are pooled at specific ratios for sequencing. From these reconstructed fluorescence profiles, we calculated a mean fluorescence as the weighted mean of the four adjacent bins summing maximal frequencies: Cambray et al. - Massive Factorial Design Untangles Coding Sequences Determinants of Translation Efficacy –Sup. Materials 14 of 18



𝜇!,!!"# =

!!!!,! !! !!!!,! (𝑓!,!,!

× 𝑖)

!!!!,! !! 𝑓!,!,! !!!!,!

(eq. 2)

where 𝑗!,! ∈ [1,12] is the index of the first bin in the four adjacent bins selected for each sequence in each replicate. This simple filtering procedure permitted us to partly resolve bimodal profiles, wherein highly fluorescent strains often show weak signal at low fluorescence. This pattern probably results from occasional mutations outside of the sequenced region, which tend to be selected amongst high reporter producers. To control for systematic differences between replicates, we rescaled each replicate linearly to minimize the between replicate error. Again, we observed that high reporter producing strains occasionally show much reduced performances in one replicate. We excluded such low values on the basis that they represent mutant that took over the original strain in a given replicate. We then quantile-normalized the data (Bolstad et al., 2003) to obtain corrected values of the mean ∗ in log space (𝜇!,! ). !"# We applied an exponential transformation to map the cleaned data on a linear space: ∗

𝜇!,!!"# = 𝑒 !!,!!"#

!

(eq. 3)

where C is a constant chosen to maximize the kurtosis of the observed distribution using the nlm package in R. Final data were rescaled between 1 and 100 to obtain the protein production metric used in all analysis: 𝑃!,! =

!!,!!"# !!"# (!!,!!"# ) !"# !!,!!"# !!"# (!!,!!"# )

∗ 99 + 1

(eq. 4)

In all graphs, data points show the mean protein production across the four replicates. Pairwise comparisons of the replicates after processing are shown in Figure S3B. Data processing, management and analysis All data were consolidated in a single dataset and analyzed using R. An archive containing the consolidated dataset including data from the companion paper, the scripts used to generate the figures, as well as other pieces of data used in the figures is provided as supplementary material. Description of the consolidated dataset. KEY

NAME

DESCRIPTION

id

unique id for a sequence (concatenate seed, combi and rep)

seed

id of each factorial series (full factorial design * 3 close sequences)

combi

combination of property levels

rep

3 sequences close to each other (1-3 mutations) representing the same combination of property levels

gs.sequence

the designed sequence (96 bp)

ds.cdsCAI

ds.utrCdsStructureMFE

ds.fivepCdsStructureMFE

ds.threepCdsStructureMFE

ds.cdsBottleneckPosition

ds.cdsBottleneckRelativeStrength

Level of CAI (Codon Adaptation Index, geometric mean of CAI over the 32 designed codons; CAI is scaled between 0 and 1 and quantify how much a sequence has a codon usage biaised ressembling that of very highly expressed genes in natural genomes) Minimun Free Energy (MFE) level of the mRNA structure at position [-30;+30] of the fused GFP reporter gene with respect to start codon. The [-30;3] segment is constant, thus the structure is dictated by variation in [1;27] of the designed sequence. The lower the MFE, the stronger the structures. Minimun Free Energy (MFE) level of the mRNA structure at position [1;+60] of the fused GFP reporter gene with respect to start codon. The [1;3] segment is constant (ATG start codon). The structure is almost dictated by variation in the designed sequence. This sequence segment overlap by 30 bases with the ds.utrCdsStructureMFE and with the ds.threepCdsStructureMFE. The lower the MFE, the stronger the structures. Minimun Free Energy (MFE) level of the mRNA structure at position [+31;+90] of the fused GFP reporter gene with respect to start codon. This sequence segment overlap by 30 bases with the ds.fivepCdsStructureMFE. The lower the MFE, the stronger the structures. Level of the position of the region with lowest tAI profile. The tAI profile attempt to quantify local rate of translation by relating approximate measure of tRNA abundances to different rate of codon translation. There are two defined levels. Level1: The bottleneck is within the design sequence, at varying positions. Level 2: The bottleneck is fixed by the end of the constant part of the reporter (codon position 251). Level of strength of the region with lowest tAI profile. The tAI profile attempt to quantify local rate of translation by relating approximate measure of tRNA abundances to different rate of codon translation. This property is nested within

Cambray et al. - Massive Factorial Design Untangles Coding Sequences Determinants of Translation Efficacy –Sup. Materials 15 of 18



ds.cdsBottleneckPosition: it is only possible to vary the strength of the bottleneck whithin the designed sequence. When the bottleneck is located by the end of the reporter (codon position 251), the strength is fixed to a constant value. ds.cdsNucleotideContentAT

ds.cdsHydropathyIndex

gs.cdsCAI

CAI

gs.utrCdsStructureMFE

STR-30:+30

gs.fivepCdsStructureMFE

STR+01:+60

gs.threepCdsStructureMFE

STR+31:+90

gs.cdsBottleneckPosition

BtlP

gs.cdsBottleneckRelativeStrength

BtlS

gs.cdsNucleotideContentAT

%AT

gs.cdsHydropathyIndex

MHI

group

log.prot.ucb3

log.prot.ucb4

log.prot.ucsf3

log.prot.ucsf4

clean.lin.prot.ucb3

Level of AT content of the designed sequence at positions [+1;+18]. The AT content simply content the fraction of A and T other the 18 bases of the defined region. 2 levels. Level of hydropathy index of the polypeptide encoded by the designed sequence at positions [+28;+60]. The average hydropathy quantify the behavior of a polypeptide with respect to an aquous solvant. Higher hydropathy indicates hydrophobicity (not miscible with water). Value of CAI (Codon Adaptation Index, geometric mean of CAI over the 32 designed codons; CAI is scaled between 0 and 1 and quantify how much a sequence has a codon usage biaised ressembling that of very highly expressed genes in natural genomes) Minimun Free Energy (MFE) value of the mRNA structure at position [-30;+30] of the fused GFP reporter gene with respect to start codon. The [-30;3] segment is constant, thus the structure is dictated by variation in [1;27] of the designed sequence. The lower the MFE, the stronger the structures. Minimun Free Energy (MFE) value of the mRNA structure at position [1;+60] of the fused GFP reporter gene with respect to start codon. The [1;3] segment is constant (ATG start codon). The structure is almost dictated by variation in the designed sequence. This sequence segment overlap by 30 bases with the ds.utrCdsStructureMFE and with the ds.threepCdsStructureMFE. The lower the MFE, the stronger the structures. Minimun Free Energy (MFE) value of the mRNA structure at position [+31;+90] of the fused GFP reporter gene with respect to start codon. This sequence segment overlap by 30 bases with the ds.fivepCdsStructureMFE. The lower the MFE, the stronger the structures. Actual position of the region with lowest tAI profile. The tAI profile attempt to quantify local rate of translation by relating approximate measure of tRNA abundances to different rate of codon translation. The position is either varying within the designed sequence or is fixed by the end of the constant part of the reporter (codon position 251). Strength of the region with lowest tAI profile. The tAI profile attempt to quantify local rate of translation by relating approximate measure of tRNA abundances to different rate of codon translation. This property is nested within ds.cdsBottleneckPosition: it is only possible to vary the strength of the bottleneck whithin the designed sequence. When the bottleneck is located by the end of the reporter (codon position 251), the strength is fixed to a constant value. AT content of the designed sequence at positions [+1;+18]. The AT content simply content the fraction of A and T other the 18 bases of the defined region. As a results the metric is coarse. Average hydropathy index of the polypeptide encoded by the designed sequence at positions [+28;+60]. The average hydropathy quantify the behavior of a polypeptide with respect to an aquous solvant. Higher hydropathy indicates hydrophobicity (not miscible with water). Priming group. The synthesized sequence are flanked by different priming region, defining three sublibrary the are amplifiable independently. Three groups comprising 15811, 15831, 212358 sequences are thus defined. Fluorescence levels measured measured by FACS-Seq after growth in regular rich MOPS medium. Weighted average of read distribution across 4 consecutive logarithmic fluorescence bins with highest read count. Replicate #1 (UCB experiment). Fluorescence levels measured measured by FACS-Seq after growth in regular rich MOPS medium. Weighted average of read distribution across 4 consecutive logarithmic fluorescence bins with highest read count. Replicate #2 (UCB experiment). Fluorescence levels measured measured by FACS-Seq after growth in regular rich MOPS medium. Weighted average of read distribution across 4 consecutive logarithmic fluorescence bins with highest read count. Replicate #3 (UCSF experiment). Fluorescence levels measured measured by FACS-Seq after growth in regular rich MOPS medium. Weighted average of read distribution across 4 consecutive logarithmic fluorescence bins with highest read count. Replicate #4 (UCSF experiment). Cleaned fluorescence levels after growth in regular rich MOPS medium. Transformed from log.prot.ucb3 to linear scale and further scaled between 1 and 100. Value set to NA if very low compared to other replicates, as it probably reflect effect of unobservable mutations.

Cambray et al. - Massive Factorial Design Untangles Coding Sequences Determinants of Translation Efficacy –Sup. Materials 16 of 18



clean.lin.prot.ucb4

clean.lin.prot.ucsf3

clean.lin.prot.ucsf4 clean.lin.prot.mean

PNC

clean.lin.prot.var

log.prot.acf.ucb5

log.prot.acf.ucb7

log.prot.acf.ucsf5

log.prot.acf.ucsf7

clean.lin.prot.acf.ucb5

clean.lin.prot.acf.ucb7

clean.lin.prot.acf.ucsf5

clean.lin.prot.acf.ucsf7

clean.lin.prot.acf.mean clean.lin.prot.acf.var

PC

Cleaned fluorescence levels after growth in regular rich MOPS medium. Transformed from log.prot.ucb4 to linear scale and further scaled between 1 and 100. Value set to NA if very low compared to other replicates, as it probably reflect effect of unobservable mutations. Cleaned fluorescence levels after growth in regular rich MOPS medium. Transformed from log.prot.ucsf3 to linear scale and further scaled between 1 and 100. Value set to NA if very low compared to other replicates, as it probably reflect effect of unobservable mutations. Cleaned fluorescence levels after growth in regular rich MOPS medium. Transformed from log.prot.ucsf4 to linear scale and further scaled between 1 and 100. Value set to NA if very low compared to other replicates, as it probably reflect effect of unobservable mutations. Average fluorescence levels measured measured by FACS-Seq after growth in regular rich MOPS medium across availble cleaned replicates (clean.lin.prot.*). Variance of the fluorescence levels measured measured by FACS-Seq after growth in regular rich MOPS medium across availble cleaned replicates (clean.lin.prot.*). Fluorescence levels measured measured by FACS-Seq after growth in rich MOPS medium supplemented with the unatural amino-acid AcF. The presence of AcF induces translational coupling which alleviate the effect of utrCdsStructureMFE on translation initiation. Weighted average of read distribution across 4 consecutive logarithmic fluorescence bins with highest read count. Replicate #1 (UCB experiment). Fluorescence levels measured measured by FACS-Seq after growth in rich MOPS medium supplemented with the unatural amino-acid AcF. The presence of AcF induces translational coupling which alleviate the effect of utrCdsStructureMFE on translation initiation. Weighted average of read distribution across 4 consecutive logarithmic fluorescence bins with highest read count. Replicate #2 (UCB experiment). Fluorescence levels measured measured by FACS-Seq after growth in rich MOPS medium supplemented with the unatural amino-acid AcF. The presence of AcF induces translational coupling which alleviate the effect of utrCdsStructureMFE on translation initiation. Weighted average of read distribution across 4 consecutive logarithmic fluorescence bins with highest read count. Replicate #3 (UCSF experiment). Fluorescence levels measured measured by FACS-Seq after growth in rich MOPS medium supplemented with the unatural amino-acid AcF. The presence of AcF induces translational coupling which alleviate the effect of utrCdsStructureMFE on translation initiation. Weighted average of read distribution across 4 consecutive logarithmic fluorescence bins with highest read count. Replicate #4 (UCSF experiment). Cleaned fluorescence levels after growth in rich MOPS medium supplemented with AcF. Transformed from log.prot.acf.ucb5 to linear scale and further scaled between 1 and 100. Value set to NA if very low compared to other replicates, as it probably reflect effect of unobservable mutations. Cleaned fluorescence levels after growth in rich MOPS medium supplemented with AcF. Transformed from log.prot.acf.ucb7 to linear scale and further scaled between 1 and 100. Value set to NA if very low compared to other replicates, as it probably reflect effect of unobservable mutations. Cleaned fluorescence levels after growth in rich MOPS medium supplemented with AcF. Transformed from log.prot.acf.ucsf5 to linear scale and further scaled between 1 and 100. Value set to NA if very low compared to other replicates, as it probably reflect effect of unobservable mutations. Cleaned fluorescence levels after growth in rich MOPS medium supplemented with AcF. Transformed from log.prot.acf.ucsf7 to linear scale and further scaled between 1 and 100. Value set to NA if very low compared to other replicates, as it probably reflect effect of unobservable mutations. Average fluorescence levels measured measured by FACS-Seq after growth in rich MOPS medium supplemented with AcF across availble cleaned replicates (clean.lin.prot.acf.*). Variance of the fluorescence levels measured measured by FACS-Seq after growth in rich MOPS medium supplemented with AcF across availble cleaned replicates (clean.lin.prot.acf.*).



Cambray et al. - Massive Factorial Design Untangles Coding Sequences Determinants of Translation Efficacy –Sup. Materials 17 of 18



REFERENCES Allert, M., Cox, J.C., and Hellinga, H.W. (2010). Multifactorial determinants of protein expression in prokaryotic open reading frames. J. Mol. Biol. 402, 905–918. Bolstad, B.M., Irizarry, R.A., Astrand, M., and Speed, T.P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185–193. Cambray, G., Guimaraes, J.C., Mutalik, V.K., Lam, C., Mai, Q.A., Thimmaiah, T., Carothers, J.M., Arkin, A.P., and Endy, D. (2013). Measurement and modeling of intrinsic transcription terminators. Nucleic Acids Res. Elowitz, M.B., Levine, A.J., Siggia, E.D., and Swain, P.S. (2002). Stochastic gene expression in a single cell. Science 297, 1183–1186. Glascock, C.B., and Weickert, M.J. (1998). Using chromosomal lacIQ1 to control expression of genes on high-copynumber plasmids in Escherichia coli. Gene 223, 221–231. Guimaraes, J.C., Rocha, M., Arkin, A.P., and Cambray, G. (2014). D-Tailor: automated analysis and design of DNA sequences. Bioinformatics. Kapust, R.B., and Waugh, D.S. (2000). Controlled Intracellular Processing of Fusion Proteins by TEV Protease. Protein Expression and Purification 19, 312–318. Kapust, R.B., Tözsér, J., Copeland, T.D., and Waugh, D.S. (2002). The P1' specificity of tobacco etch virus protease. Biochem. Biophys. Res. Commun. 294, 949–955. Kosuri, S., and Church, G.M. (2014). Large-scale de novo DNA synthesis: technologies and applications. Nat. Methods 11, 499–507. Kosuri, S., Goodman, D.B., Cambray, G., Mutalik, V.K., Gao, Y., Arkin, A.P., Endy, D., and Church, G.M. (2013). Composability of regulatory sequences controlling transcription and translation in Escherichia coli. Proceedings of the National Academy of Sciences 110, 14024–14029. Lee, T.S., Krupa, R.A., Zhang, F., Hajimorad, M., Holtz, W.J., Prasad, N., Lee, S.K., and Keasling, J.D. (2011). BglBrick vectors and datasheets: A synthetic biology platform for gene expression. J Biol Eng 5, 12. Li, H., and Durbin, R. (2010). Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595. Liang, J.C., Chang, A.L., Kennedy, A.B., and al, E. (2012). A high-throughput, quantitative cell-based screen for efficient tailoring of RNA device activity. Nucleic Acids …. Miller, W.G., Leveau, J.H.J., and Lindow, S.E. Improved gfp and inaZ Broad-Host-Range Promoter-Probe Vectors. Http://Dx.Doi.org/10.1094/MPMI.2000.13.11.1243. Mutalik, V.K., Guimaraes, J.C., Cambray, G., Lam, C., Christoffersen, M.J., Mai, Q.-A., Tran, A.B., Paull, M., Keasling, J.D., Arkin, A.P., et al. (2013). Precise and reliable gene expression via standard transcription and translation initiation elements. Nat. Methods. Sjöström, M., and Wold, S. (1985). A multivariate study of the relationship between the genetic code and the physicalchemical properties of amino acids. J Mol Evol 22, 272–277. Taniguchi, Y., Choi, P.J., Li, G.-W., Chen, H., Babu, M., Hearn, J., Emili, A., and Xie, X.S. (2010). Quantifying E. coli proteome and transcriptome with single-molecule sensitivity in single cells. Science 329, 533–538. Tuller, T., Carmi, A., Vestsigian, K., Navon, S., Dorfan, Y., Zaborske, J., Pan, T., Dahan, O., Furman, I., and Pilpel, Y. (2010). An Evolutionarily Conserved Mechanism for Controlling the Efficiency of Protein Translation. Cell 141, 344– 354. Young, T.S., Ahmad, I., Yin, J.A., and Schultz, P.G. (2010). An enhanced system for unnatural amino acid mutagenesis in E. coli. J. Mol. Biol. 395, 361–374.

Cambray et al. - Massive Factorial Design Untangles Coding Sequences Determinants of Translation Efficacy –Sup. Materials 18 of 18