De Novo Sequencing of Nonribosomal Peptides

0 downloads 0 Views 410KB Size Report
Key words: Cyclic Peptides Sequencing De novo Algorithm .... for de novo sequencing of cyclic peptides. ..... FEMS Microbiology Reviews 30 (2006) 530–563. 4.
De Novo Sequencing of Nonribosomal Peptides Nuno Bandeira1 , Julio Ng1 , Dario Meluzzi1 , Roger G. Linington2 , Pieter Dorrestein1 , Pavel A. Pevzner1 1 2

University of California, San Diego, USA University of California, Santa Cruz, USA [email protected]

Abstract. While nonribosomal peptides (NRPs) are of tremendous pharmacological importance, there is currently no technology capable of highthroughput sequencing of NRPs. Difficulties in sequencing NRPs slow down the progress in elucidating the non-ribosomal genetic code and negatively affect various screening programs aimed at the discovery of natural compounds of medical importance. We propose to employ multistage mass-spectrometry (MSn ) for the data acquisition, followed by alignment-based heuristic algorithms for data analysis. Since mass spectrometry based analysis of NRPs is fast and inexpensive, this approach opens the possibility of high-throughput sequencing of many unknown NRPs accumulated in large screening programs. Key words: Cyclic Peptides Sequencing De novo Algorithm

1

Introduction

The classical protein synthesis pathway (translation of template mRNA into proteins/peptides) is not the only mechanism for cells to assemble amino acids into peptides. The alternative Non Ribosomal Peptide Synthesis is performed by a large multi-enzyme complex (called Non Ribosomal Peptide Synthetase or NRPS) that represents both the biosynthetic machinery and the mRNA-free template for the biosynthesis of secondary metabolites (see [1–3] for recent reviews). NRPS gene clusters produce relatively short (up to 50 aa) nonribosomal peptides (NRP) that are not directly inscribed in the genomic DNA and thus cannot be inferred with traditional DNA-based sequencing techniques. NRPs are of tremendous pharmacological importance since they were optimized during millions of years of evolution to play important roles in chemical defense and communication for producing organisms. Starting from penicillin, NRPs and other natural products have an unparallel track record in pharmacology: 9 out of the top 20 best-selling drugs were either inspired by or derived from natural products. NRPs have some naturally evolved features that are applicable to the modulation of protein function in human systems, making them excellent lead compounds for the development of novel pharmaceutical agents. In particular, NRPs include antibiotics (penicillin, cephalosporin, vancomycin,

2

De Novo Sequencing of Nonribosomal Peptides

etc.), immunosuppressors (cyclosporin), antiviral agents (luzopeptin A), antitumor agents (bleomycin), toxins (thaxtomin), and many peptides with yet unknown functions. When DNA sequencing is not available, biologists use either Edman degradation or tandem mass spectrometry (MS2 ) to sequence ribosomal peptides. However, neither of these approaches works for nonribosomal peptides since they differ from ribosomal peptides in many respects: (i) they often represent non-linear structures of amino acids, e.g., cyclic, tree-like, and branch-cyclic peptides, (ii) they oftecephalosporinn contain non-standard amino acids increasing the number of possible building blocks from 20 to several hundreds, (iii) they often have a non-standard backbone, and (iv) they are often modified. Each of these complications renders traditional Edman degradation and MS2 peptide sequencing approaches useless, leaving NMR as the only technology capable of analyzing NRPs [4–7]. The use of NMR for NRP sequencing is time-consuming, difficult to automate (there are currently no software tools for automatic interpretation of NRPs from NMR data), and error-prone (see [7, 8] for examples of errors in NMR sequencing). As a result, the extremely difficult total chemical synthesis remains the only reliable way to sequence and validate NRPs [9]. For example, Patrick Harran won the 2007 Hackerman Award in Chemical Research for his pioneering work on diazonamide A, a rare marine NRP. In the process of synthesizing diazonamide A, he discovered that the initial structure reported for the molecule was flawed [10]. An efficient and automated way to sequence NRPs will immediately benefit all searches for natural compounds of medical importance as well as studies of the still poorly understood mechanisms of the nonribosomal peptide synthesis. Currently, the prediction of the chemical structure of the unknown NRPs is not possible even if all genes involved in non-ribosomal synthesis are identified [11]. Furthermore, efficient NRP sequencing will aid biosynthetic engineering efforts to reprogram the NRP assembly lines in E. coli (most microbes producing NRPs are not amenable to cultivation). For example, the recent success in production of antitumor NRPs in E. coli required analysis of genetically engineered NRPs [12]. This paper introduces a combination of experimental and computational protocols that enable a mass-spectrometry based approach to sequencing NRPs. To the best of our knowledge, it is the first attempt to de novo sequence NRPs using mass spectrometry. Previous studies were limited to detection or resequencing of NRP, i.e., sequencing of new NRP variants when the major NRP variant was known. In an early attempt to use mass spectrometry for NRP analysis, Barber et al., 1992 [13] analyzed variants of tyrothricin, an antimicrobial agent produced by Bacillus brevis. Tyrothricin is a mixture of different NRPs and three major components of this mixture were previously identified. These three components were used as reference points to derive six other variant NRPs using mass spectrometry. In a recent study, Hitzeroth et al., 2005 [14] resequenced new variations of streptocidins on a MALDI-TOF-MS using information about previously sequenced streptocidins [15, 16]. However, as the authors of [14] remarked this resequencing strategy is limited to peptides with pure amino acid

De Novo Sequencing of Nonribosomal Peptides

3

sequence but grows more difficult when modifications are present. In another application of mass spectrometry to NRP analysis, Redman et al., 2003 [17] developed an algorithm for identification of cyclic peptides from combinatorial libraries. This approach amounts to accurate scoring of all candidates from the predefined library and cannot be applied to de novo sequencing since the set of all possible peptides grows exponentially with the length of peptide. In the following sections, we employ multi-stage mass-spectrometry (MSn ) for de novo sequencing of cyclic peptides. We first describe the NRP-Sequencing algorithm for reconstructing cyclic peptides from a single MS3 spectrum. We then extend the experimental protocol by incorporating MS4 and even MS5 spectra to score putative MS3 reconstructions against all MS4 /MS5 spectra. Finally, we describe the NRP-Assembly approach that assembles MS4 /MS5 spectra and further integrates the resulting contig with all non-assembled spectra. The choice of a particular approach for analyzing NRPs depends on the specifics of the peptide, its fragmentation properties, accuracy of the mass spectrometer, etc. In the remainder of the paper, we do not distinguish between MS4 and MS5 spectra and refer to them as MSn spectra. We note that although multi-stage mass spectrometry recently emerged as a valuable tool for peptide identification [18, 19], this is the first report on using multi-stage mass-spectrometry for de novo peptide sequencing and sampling as many as 5 stages of mass-spectrometry (previous studies were mainly limited to 2 stages).

2

Sequencing cyclic peptides

We start by analyzing the simplest version of the NRP sequencing problem when the NRP is a cyclic peptide. Below we use NRP Seglitide, a somatostatin receptor antagonist, as an illustration. Seglitide is more potent than somatostatin for inhibition of insulin, glucagon and growth hormone release, and it is used experimentally in the treatment of Alzheimer’s disease. The structure of Seglitide is Cyclic(N-methyl-Ala-Tyr-D-Trp-Lys-Val-Phe).3 For a cyclic peptide P = p1 . . . pn it results in n possible linear peptides Pi = pi . . . pn p1 . . . pi−1 with the same parent mass (Fig.1). The mixture of these peptides is further subjected to the next stage of mass spectrometry (MS3 ) resulting in the difficult problem of interpreting an MS3 spectrum of n different (but related) peptides. The theoretical MS3 spectrum Spectrum(P ) of the cyclic peptide P is thus the superposition of the theoretical spectra Spectrum(Pi ) of linear peptides Pi as shown in Figure 1. Therefore, reconstructing the circular peptide P from its theoretical spectrum Spectrum(P ) amounts to the circular version of the classical Partial Digest Problem (PDP) [20]. While the complexity status of linear PDP remains unknown (a pseudopolynomial algorithm for PDP is described in [21]), a simple branch-and-bound algorithm works well in practice [20]. However, it appears that the circular version of PDP may be harder than its linear version, in particular, the Rosenblatt3

We remark that tandem mass-spectrometry (MS2 ) amounts to simply breaking (linearizing) the cyclic peptide and does not generate any useful information.

4

De Novo Sequencing of Nonribosomal Peptides

Seymour pseudo-polynomial algorithm for linear PDP [21] does not generalize for circular PDP. While reconstructing the cyclic peptide P from its theoretical Spectrum(P ) is already a hard problem, reconstructing P from its experimental MS3 spectrum S is much more difficult In practice, the contributions of different linear versions of P to the experimental spectrum are highly non-uniform. For example, if a certain bond (e.g., before pi ) has a low propensity for breakage in the mass spectrometer, the spectrum Pi may not contribute any peaks to the MS3 spectrum S. Such missing peaks combined with many noise peaks make the reconstruction very hard (the PDP problem is known to be NP-hard for noisy inputs [22]) and lead to the following Cyclic Peptide Sequencing Problem that is similar to the NP-hard problem of peptide sequencing in the presence of internal ions [23].

b)

a)

A+14

F V K W Y

A+14

Y W

A+14 F V Y

Y A+14 Y F A+14 K V W K W

K

W W Y F V K

A+14 F V

V K W Y A+14 F

c)

K

F V

d)

Fig. 1. Analysis of the cyclic peptide Seglitide. a) The circular structure of Seglitide is schematically illustrated with each residue represented by a different color (slice sizes not scaled to corresponding masses of the residues). A+14 denotes a non-standard residue with integer mass 71+14=85 Da. b) MS2 fragmentation of Seglitide generates up to 6 linear peptides representing different rotated variants of the same cyclic peptide. c) Theoretical spectrum for Seglitide by superposition of the fragment masses of the linearized peptides. For simplicity, only prefix masses (b-ions) are shown here. d) Experimental spectrum of Seglitide resulting from a mixture of 6 linear peptides (the peaks corresponding to prefix ions are shown in red).

Cyclic Peptide Sequencing Problem (CPSP). Given an experimental MS3 spectrum S, find a cyclic peptide P maximizing the number of shared masses between S and the theoretical spectrum of P . Since the branch-and-bound approach to solving CPSP is prohibitively timeconsuming, we describe some alignment-based heuristics that take advantage of the specifics of the particular CPSP instances arising in NRP studies. Sequencing cyclic peptides using MS3 spectra. Pevzner et al., 2000 [24] introduced spectral convolution and spectral alignment for revealing similarities between related but different spectra. We argue that since an experimental MS3 spectrum of a cyclic peptide is a superposition of multiple spectra of linearized peptides, spectral auto-convolution and auto-alignment should reveal key features (e.g. amino acid composition and true peaks) for the identification of the peptide.

De Novo Sequencing of Nonribosomal Peptides

5

The spectral convolution between spectra S and S 0 is defined as the number of masses s in S such that s−x is also a mass in S 0 (for every parameter x).4 Also, the cyclic convolution Conv(S, S 0 , x) of spectra S and S 0 is defined as the number of masses s in S such that either (s−x) or (s−x)+P arentM ass(S) is also a mass in S 0 . The auto-convolution Conv(S, x) of a spectrum S is simply the cyclic convolution of S with itself. Figure 2c presents the auto-convolution of the MS3 spectrum for Seglitide, a 6 amino acid long cyclic peptide A+14 YWKVF (integer residue masses are 85, 163, 186, 128, 99 and 147, respectively). As expected, peaks of the auto-convolution reveal neutral losses (e.g., the peak at 18 corresponds to H2 O losses). However, in the case of cyclic peptides, there are many other high-scoring peaks. For example, the largest peak Conv(S, 85) = 14 corresponds to the mass of amino acid A+14 (auto-alignment of the spectrum S with offset A+14 reveals many aligned peaks). Other amino acids in Seglitide also correspond to high peaks: Conv(S, 163) = 10, Conv(S, 186) = 8, Conv(S, 128) = 8, Conv(S, 99) = 8, and Conv(S, 147) = 8. We remark that in the interval between 50 and 200Da there are only 4 other peaks with Conv(S, x) ≥ 8 (at offsets 78, 81, 103 and 191) indicating that spectral convolution can be used to derive the set of amino acid masses present in the circular peptide. The auto-alignment of the spectrum S with offset x is defined as the set of peaks {s : (s − x) ∈ S}. We view auto-alignment (denoted Sx ) as a virtual spectrum with parent mass equal to P arentM ass(S) − x. The auto-alignment of Seglitide’s MS3 spectrum of with offset 85Da (maximum peak revealed by spectral convolution) corresponds to the alignment between A+14 YWKVF and YWKVFA+14 . Similarly to the spectral alignment of spectra from different peptides [25], one would expect auto-alignment to mostly reflect either prefix or suffix ion fragments from the linearized peptides A+14 YWKVF and YWKVFA+14 (with the number of noisy peaks greatly reduced). The separation of prefix (e.g., b-ions) and suffix (e.g., y-ions) ladders by spectral alignment is important since it significantly simplifies spectral interpretation and enables accurate de novo peptide sequencing [26, 27]. However, it turns out that interpretation of autoalignments (of cyclic peptides) is more complex than interpretation of spectral alignments of different (linear) peptides. While auto-alignment reduces the noise, it does not separate prefix and suffix ladders, i.e., auto-alignment contains both prefix and suffix ladders. This is caused by the fact that the MS3 spectrum contains peaks from both A+14 YWKVF and YWKVFA+14 . Thus, the b-ions from A+14 YWKVF match the b-ions of YWKVFA+14 and, moreover, the y-ions from YWKVFA+14 match the y-ions of A+14 YWKVF (with the same offset 85 for both b- and y-ions). When the set of possible amino acid masses is known in advance (like in traditional peptide sequencing), one can interpret the auto-alignment Sx of the MS3 spectrum S using either the anti-symmetric path approach [28] or the spectral 4

While the standard spectral convolution simply counts the number of peaks separated by mass x, in the case of scored spectra (represented as vectors S = (s1 . . . sn ) and S 0 = (s01 . . . s0n ) reflecting P peak intensities or other characteristics) the spectral convolution is defined as i=1,n si · s0i−x .

6

De Novo Sequencing of Nonribosomal Peptides #" 6(,*(#%$)30#%(-)70*),"")30--#1"')"#%',*#4'5) 3'3(#5'-)7*0&)(.')/8/"#/)3'3(#5')6'$"#(#5'

!" !"#$%&'%()*'+',"-)(.')/0&&0%)&,--'-)1'(2''%)0+'*",33#%$ )"#%',*#4'5)3'3(#5'! "

' " # ( ,-!$ $ ! & % ) + *

# $ % & !)=)>0#%()2.'*')(.')/0%+0"9(#0%)1'/0&'-)/#*/9",*?)30*(#0%-) 1'80%5)(.#-)"#%'),*')&,(/.'5)7*0&)(.')"#%')&,*;'5)2#(.)@A