Charge State Estimation for Tandem Mass Spectrometry Proteomics

0 downloads 0 Views 157KB Size Report
High-throughput protein analysis by tandem mass spectrometry produces anywhere ..... loss of a neutral water molecule, after manual inspection of the mass ...
5847_03_p233-250

9/22/05

1:05 PM

Page 233

OMICS A Journal of Integrative Biology Volume 9, Number 3, 2005 © Mary Ann Liebert, Inc.

Charge State Estimation for Tandem Mass Spectrometry Proteomics *JASON

M. HOGAN, *ROGER HIGDON, NATALI KOLKER, and EUGENE KOLKER

ABSTRACT High-throughput protein analysis by tandem mass spectrometry produces anywhere from thousands to millions of spectra that are being used for peptide and protein identifications. Though each spectrum corresponds only to one charged peptide (ion) state, repetitive database searches of multiple charge states are typically conducted since the resolution of many common mass spectrometers is not sufficient to determine the charge state. The resulting database searches are both error-prone and time-consuming. We describe a straightforward, accurate approach on charge state estimation (CHASTE). CHASTE relies on fragment ion peak distributions, and by using reliable logistic regression models, combines different measurements to improve its accuracy. CHASTE’s performance has been validated on data sets, comprised of known peptide dissociation spectra, obtained by replicate analyses of our earlier developed protein standard mixture using ion trap mass spectrometers at different laboratories. CHASTE was able to reduce number of needed database searches by at least 60% and the number of redundant searches by at least 90% virtually without any informational loss. This greatly alleviates one of the major bottlenecks in high throughput peptide and protein identifications. Thresholds and parameter estimates can be tailored to specific analysis situations, pipelines, and instrumentations. CHASTE was implemented in Java GUI-based and command-line-based interfaces. INTRODUCTION

M

has recently become the most common approach for high throughput protein sample analysis (Aebersold and Goodlett, 2001; Kolker et al., 2003; Leibler, 2002; Washburn et al., 2001). Complex protein mixtures are typically converted into peptides through enzymatic or chemical cleavage. The most commonly used enzyme for protein digestion is trypsin, as the resulting peptides are often smaller and more amenable to positive ion analysis. The resulting peptides are usually subjected to gel electrophoresis or chromatographic separation to reduce the mixture complexity. The separated peptides are ionized by electrospray (Fenn et al., 1989), nanospray (Van Berkel et al., 2001), or matrix-assisted laser desorption ionization (Karas and Hillenkamp, 1988) and introduced into a mass spectrometer. Peptide analyses have been performed on a variety of instruments, including ion trap, triple quadrupole, quadrupole time-of-flight, tandem time-of-flight, and Fourier transform ion cyclotron resonance instruments. Tandem mass specASS SPECTROMETRY

BIATECH, Bothell, Washington. *Authors contributed equally to this study and should be considered first authors.

233

5847_03_p233-250

9/22/05

1:05 PM

Page 234

HOGAN ET AL. trometry (MS/MS) is used to obtain peptide sequence information by interpretation of the resulting fragment ion spectrum. Peptide and protein identifications are typically accomplished via database searching of uninterpreted peptide MS/MS spectra. The two most commonly used programs are SEQUEST (Eng et al., 1994; MacCoss et al., 2002) and Mascot (Perkins et al., 1999). These algorithms compare the experimental peptide fragmentation spectrum to theoretical peptide product ion spectra created in silico from amino acid sequences contained in a protein sequence database. As protein sequence databases continue to expand, the increasing numbers of protein sequences pose challenges to peptide identification software to correctly relate a MS/MS spectrum to a peptide sequence present in the protein database. The use of electrospray or nanospray ionization also adds to this challenge due to the ability of these methods to produce multiply charged ions. Multiply charged ions typically provide more peptide sequence information than singly charged ions allowing better peptide identification, especially when searching for post-translational modifications. The drawback of MS/MS spectra of multiply charged ions is that the fragmentation spectra are more complex. Though high-end instruments can observe the isotopic spacing of precursor and fragment ions to interpret their charge state, most commonly used ion trap and triple quadrupole instruments generally have inadequate resolution to do so. To overcome this limitation, database searches are typically conducted assuming all possible charge states of the precursor ion and interpreting the search results to determine the charge state. This approach increases the overall time of the database search, as well as the possibility of false positive identifications due to the increased number of possible peptide candidates. The necessity of knowing the precursor ion charge state is not limited to database search algorithms, but is a necessary component of de novo sequencing methods as well (Dancik et al., 1999). Many proteomics laboratories have implemented straightforward approaches to differentiate 1 charged spectra from multiply charged ones (Tabb et al., 2001) and a few have implemented approaches to discriminate 2 and 3 spectra (Colinge et al., 2003; Dancik et al., 1999; Sadygov et al., 2002). These are largely based on identifying the presence of complimentary fragment pairs present in the MS/MS spectra under the assumed charge state and comparing the distribution of fragment peaks relative to the precursor mass to charge ratio. However, very little has been published about the accuracy and reliability of such approaches. Recently, a support vector machine classifier was developed to differentiate MS/MS spectra of doubly and triply charged peptides while maintaining almost all true positive peptide identifications (Klammer et al., 2005). The support vector machine utilizes 34 different features to predict the peptide charge state. Despite whatever approach is used for charge state determination, it is unlikely that it will apply universally to diverse types of protocols, pipelines, and instrumentation. Therefore, there is a clear need for an accurate, generalized, and statistically solid approach that can be tailored and validated for each specific analysis situation. We herein present such an approach for charge state estimation (CHASTE) for peptide spectra with 1, 2, or 3 charge states. CHASTE relies on fragment ion peak distributions, and by using reliable logistic regression models, combines different measures to improve its accuracy. CHASTE’s performance has been validated on data sets obtained on two ion trap mass spectrometers (the LCQ and LTQ from Thermo Electron). These data sets were obtained at three different labs by analyzing our earlier developed protein standards (Purvine et al., 2004b) with the most commonly used database search program, SEQUEST. CHASTE was able to reduce number of needed database searches by at least 60% and the number redundant searches by at least 90% virtually without any informational loss. This greatly alleviates one of the major bottlenecks in high throughput peptide and protein identifications. CHASTE’s thresholds and parameter estimates can be easily tailored to specific analysis situations, pipelines, and instruments. CHASTE was implemented in Java GUI-based and command-line-based interfaces that are available for free for research purposes.

MATERIALS AND METHODS Creation of data sets A standard mixture of peptides from stand-alone and protein digest sources was described in our recent study (Purvine et al., 2004b). This mixture was analyzed multiple times via liquid chromatography tandem 234

5847_03_p233-250

9/22/05

1:05 PM

Page 235

CHASTE FOR TANDEM MASS SPECTROMETRY PROTEOMICS mass spectrometry (LC/MS/MS). This mixture is comprised of 23 individually characterized tryptic peptides and 12 proteins, all commercially available from Sigma (St. Louis, MO). After LC/MS/MS analysis of the individual protein digests, the proteins were mixed at concentrations described previously (Purvine et al., 2004b), reduced and alkylated, digested with modified porcine trypsin (V511A, Promega, Madison, WI). These mixtures were combined with two respective mixtures, equimolar and normalized (to yield similar peak intensities), of the 23 peptides. This final mixture, comprised of approximately 250 tryptic peptides, was analyzed as described in detail in (Kolker et al., 2005; Purvine et al., 2004b) at two labs, BIATECH and Pacific Northwest National Laboratory (PNNL) using a LCQ DecaXP Plus ion trap mass spectrometer, and at Thermo Electron using a LTQ ion trap mass spectrometer (Thermo Electron Corp., San Jose, CA). The MS/MS data from the mixture analyses were then analyzed using the SEQUEST Browser (packaged with the Bioworks software tools from Thermo Electron), wherein MS/MS scans were converted to flat data files (DTA files) of assumed 1, 2, and 3 charge states. These DTA files were searched via the SEQUEST algorithm against a database of the 23 peptides and 12 proteins in our standards, known contaminants, and all proteins from Shewanella oneidensis strain MR-1 (Heidelberg et al., 2002; Kolker et al., 2005). The SEQUEST outputs were processed with our earlier developed Logistic Identification of Peptide Sequences (LIPS) models (Higdon et al., 2004). LIPS models fit the logit (log odds) of the probability of correct identification, (LIPS index) as a linear combination of the predictors (McCullagh and Nelder, 1999). These predictors currently include: cross-correlation score (Xcorr), relative difference between the first and second highest cross-correlations (Cn), peptide length (PL), charge state (CS), and number of tryptic termini (NTT): LIPS  0  1  Xcorr  2   Cn  3  PL  4  CS  5  NTT

(1)

where default values for this standard base LIPS model are as follows: 0  7.1; 1  2.6; 2  13.4; 3  0.14; 4  1.0; and 5  2.9. The weights were calculated using logistic regression models that were first trained on one subset of the above protein standards and then validated on another subset as described in detail in our previous work (Higdon et al., 2004). These LIPS indexes can be then easily converted into estimates of the probability of a correct peptide match: eLIPS p   1  eLIPS

(2)

Determination of charge state 1 Theoretically the MS/MS spectrum of a peptide with a 1 charge state should contain no fragment ion peaks with a m/z greater than the precursor m/z value. Accordingly, peptides with charge state 2 or higher should contain fragment ion peaks greater than the precursor m/z. Ideally one should be able to check for the presence of m/z peaks greater than the precursor to determine 1 charge state. However, MS/MS spectra generally contain a number of noise peaks. As part of our previously described SPEQUAL approach (Purvine et al., 2004a), a charge state 1 test was conducted by filtering out all peaks below a given intensity (SPEQUAL’s default value is 5% of the maximum intensity peak). If the number of remaining peaks at m/z greater than the precursor m/z is less than a given threshold (default value of two peaks), the peptide is assigned the 1 charge state. Otherwise, the peptide is assumed to be charge state 2 or higher. In addition to this method, CHASTE also calculates the percent of all peak intensities at m/z greater than the precursor (regardless of relative intensity).

Discrimination of charge states 2 and 3 Tandem mass spectrometry of multiply charged peptide ions will produce complementary sets of fragment ions, so the total mass and charge of the precursor ion are conserved. Provided both fragment ions are within the mass range of the instrument, complementary sets of singly charged ions will typically be obtained from doubly charged peptide ions. Accordingly, triply charged peptide ions will produce complementary fragment ion sets containing one singly and one doubly charged fragment ion. (The only excep235

5847_03_p233-250

9/22/05

1:05 PM

Page 236

HOGAN ET AL. tion to this rule involves neutral losses, where the total charge of the precursor ion is retained on one of the fragment ions and the neutral fragment ion is not observed.) To discriminate between 2 and 3 ions (2/3 tests), CHASTE implements the following procedure. First, the fragment ion peaks are sorted, based on their abundance, and the top peaks are retained in the fragment list (CHASTE’s default value is 100). Second, any isotopic peaks within a short m/z range surrounding each fragment ion (CHASTE’s default values are: 2.5 m/z above and 2.0 m/z below each fragment ion) are removed from this fragment list, starting with the most abundant fragment ion. Third, the subset of the top most abundant fragment ion peaks (CHASTE’s default value is 15) is used as the search list. Next, in the case of spectra assumed to be from doubly charged peptides, each m/z value in this search list is then compared to each fragment ion in the fragment list. Such comparisons are done to find the matches that will sum to the original singly charged precursor ion mass plus a proton within a reasonable mass tolerance (CHASTE’s default value is 3 Da). Each matching pair of complementary ions is then counted. (Each m/z value in the search list may only be counted once, and redundant matches are discarded.) Doubly charged fragment ions have no observable complementary ion and will only give rise to random matches. In contrast, MS/MS spectra of triply charged peptides can produce fragment ions that are singly, doubly, and triply charged. Since the charge state of each m/z value in the search list is unknown, it must be searched as if it were a singly and doubly charged fragment ion. This results into two independent searches. After adjusting for the m/z values in the search list for the charge state, each m/z value is again compared to each fragment ion in the fragment list as done for the doubly charged case. Each m/z value in the search list can only be counted once, but a m/z value will be counted twice if it matches a complementary ion that is singly charged and a separate complementary ion that is doubly charged. Triply charged fragment ions, in this case, have no observable complementary ions and will also give rise to random matches as in the doubly charged case. Finally, after counting all the matched complementary ion pairs for both the doubly and triply charged case, the difference between the two values (2 matches minus 3 matches) is used as a predictor for determining the precursor peptide charge state. In addition to the number of matching complimentary ion pairs, the distribution of peaks in relation to precursor m/z also contains information about the precursor charge state. For the same reason as described for the MS/MS spectra of charge state 1, spectra at charge state 2 should contain no fragment ion peaks above twice the precursor m/z. Additionally, spectra at charge state 3 should also contain a greater proportion of fragment ion peaks at greater than the precursor m/z. To measure these CHASTE uses the percent of all peak intensities at m/z greater than the precursor m/z (as was used to test for charge state 1) and the percent of all peak intensities at m/z greater than twice the precursor m/z (regardless of relative intensity). These measures are being combined with the difference in the number complimentary fragment ion matches between charge states 2 and 3 described above by using logistic regression models (McCullagh and Nelder, 1999), similar to our LIPS model, to create a compound predictor of charge state used in CHASTE (2/3 tests).

Training dataset Data from four LC/MS/MS runs of our protein standard mixtures described above were used as a training set for determining the thresholds and for training the CHASTE logistic regression models. To obtain a set of known peptide ion charge states, only MS/MS spectra that were matched to proteins contained in our set of standard peptides and proteins and were determined to have high quality matches (high confidence peptide identifications with LIPS probabilities greater than 0.9) were retained. This results in a training set with very high certainty of correct peptide identifications and thus correct precursor peptide ion charge states. The training set contains a total of 1,250 peptide MS/MS spectra, of which 98 were charge state 1, 877 were 2, and 275 were 3.

Test datasets and validation of the charge state determinations An additional six LC/MS/MS runs on our protein standard mixture were used as a first test dataset to validate the charge state determinations. This resulted in 7,536 MS/MS spectra leading to 22,382 DTA files at charge states 1, 2 and 3 that were searched by SEQUEST (note that 226 3 DTA files were not created due to restrictions on the precursor mass). The same criterion as in the previous section was used to generate 236

5847_03_p233-250

9/22/05

1:05 PM

Page 237

CHASTE FOR TANDEM MASS SPECTROMETRY PROTEOMICS a set of peptides with known charge states (high confidence peptide identifications with LIPS probabilities 0.9). Additionally, a second less stringent criterion was then used to increase the size of the test dataset and include lesser quality spectra. For this criterion a peptide was considered correctly assigned if the following conditions were met: (a) peptide was matched to one of the standard proteins or peptides; (b) its length is at least 5 amino acids; and (c) peptide cannot be matched to standard proteins or peptides at more than one charge state. Obviously some of these peptides will be matched to the incorrect charge state due to the usual random matching of incorrect peptides. To adjust for this, the number of random matches to correct proteins or peptides was estimated by searching the spectra against a reversed sequence database (Higdon and Kolker, unpublished data). In addition to the above dataset, two additional test datasets, one obtained by PNNL (six LCQ runs) and the other by Thermo Electron (1 LTQ run), were analyzed to validate CHASTE’s performance using procedures similar to the ones described above. Estimated charge states based on the 1 and 2/3 tests described above were compared to the known charge states for these datasets. Measures of sensitivity, specificity, overall accuracy, and false positive rate as defined in Table 1 were calculated for different thresholds and versions of the 1 and 2/3 tests.

Evaluation of predictive value of the charge state determinations This study included an evaluation of any potential gain or loss in the ability to predict which peptides are correctly identified by SEQUEST and the LIPS standard default model by screening out DTA files using the 1 and 2/3 tests. Additionally, we examined the tests’ utility as an additional indicator of peptide match quality (peptide identification probability). This was done by incorporating two variables indicating whether a DTA file would be kept as a result of passing the 1 test or both the 1 and 2/3 tests into our LIPS models. The models were fit on the training dataset (four MS/MS runs) described above. All matches to contaminants, peptides shorter than five amino acids, and peptides with multiple matches to standard proteins at different charge states were excluded. For purposes of training the models, peptide matches were considered correct if they were matched to one of the standard proteins or peptides, and not a Shewanella oneidensis protein. Models were compared to our standard LIPS model (eq. 1). Each of the above indicator variables TABLE 1.

DEFINITIONS

OF

ACCURACY MEASURES

Actual Classification (“Gold” Standard)

Test Classification

Positive

Negative

Positive

True Positive (tp)

False Positive (fp)

Np

Negative

False negative (fn)

True negative (tn)

Nn

Np

Nn

Nn

tp  tn Accuracy: Acc   N.. tp Sensitivity: Se   N.p tn Specificity: Sp   N.n fp False positive rate: FPR   Np. fn False negative rate: FNR   Nn. Depending on the situation, positive and negative correspond to different outcomes. For peptide identification, positive is a correct match and negative is an incorrect match; for the 1 test, positive is 1 charge state and negative is 2 or higher; and for the 2/3 test, positive is 2 and negative is 3. However, the latter assignment is arbitrary, so specificity for 2 is equivalent to sensitivity for 3 and the false negative rate for 2 is equivalent to the false positive rate for 3, and so on.

237

5847_03_p233-250

9/22/05

1:05 PM

Page 238

HOGAN ET AL. TABLE 2.

20% GT PC

ACCURACY

OF THE

Predicted CS

1

1 2/3

97 1

1 TEST

ON THE

TRAINING DATASET

Actual CS 2/3 1 1151

Sensitivity, FP rate and Accuracy 1 Sens  99.0% 1 FP rate  1.0% 2/3 Sens  2/3 FP rate  0.1% 99.9% Accuracy  99.8%

15% GT PC

1 2/3

120 2

8 1332

1 Sens  98.1% 2/3 Sens 99.4%

1  2/3

Accuracy  99.3% 2 peaks  5%

1 2/3

118 4

10 1330

1 Sens  96.7% 2/3 Sens 99.3%

1  2/3

Accuracy  99.0%

for the charge state tests were added to this model and the models were also fit using the standard model predictors after removal of DTA files failing the 1 test and both the 1 and 2/3 tests, respectively. The models were validated against the first test dataset described above. Receiver Operator Characteristic (ROC) curves were generated for each model. A ROC curve is a plot of the Sensitivity versus (1 minus Specificity) for all possible thresholds of fitted probabilities (eq. 2) from the LIPS models (Table 1 defines accuracy measures used in this study). These curves show the trade off between finding correct peptide identifications and making false positive identifications. Greater area underneath the curve is indicative of better predictive ability (Pepe, 2003).

RESULTS Determination of initial thresholds To differentiate 1 from 2/3 MS/MS spectra, a minimum percentage intensity threshold, based on the percentage of the total ion intensity contained in fragment ion peaks with m/z values greater than the TABLE 3.

2/3 Test

ACCURACY

OF THE

2/3 TEST

Predicted CS

Actual CS 2 3

2 2/3 3

754 121 1

3 103 170

ON THE

TRAINING DATASET Sensitivity, FP rate and Accuracy

2 Sens  86.1% Not assigned CS  19.4% 3 Sens  61.6%

2 FP rate  0.4% 3 FP rate  0.6%

Accuracy  99.5% Logistic 2/3 Test

2 2/3 3

805 79 1

3 70 194

2 Sens  90.1% Not assigned CS  12.9% 3 Sens  72.7%

2 FP rate  0.4% 3 FP rate  0.5%

Accuracy  99.6% the accuracy value excludes those peptides assigned to 2/3, but these values are included when calculating sensitivity. aNote

238

5847_03_p233-250

9/22/05

1:05 PM

Page 239

CHASTE FOR TANDEM MASS SPECTROMETRY PROTEOMICS precursor m/z , was utilized. The use of a minimum threshold of 20% resulted in the highest accuracy (99.8%) for determination of 1 charge state spectra based on the training set data (Table 2). Using a threshold of 15% resulted in slightly worse accuracy (99.6%) but was included because 20% was a surprisingly large threshold and the test based on number of peaks (threshold of 2 peaks) above 5% relative intensity resulted in lower accuracy (99.0%). In the 2/3 tests, to make the distribution of the difference in complementary matches between 2 and 3 charge state peptides symmetric around 0 (2 mean and 3 mean), 5 was added to the number of matches assuming a 2 charge state. This is necessary since matches at the 3 charge state can occur with both singly and doubly charged fragments compared to only singly charged fragments at the 2 charge state. As seen in Table 3, if peptides with a difference (2 matches plus 5 minus 3 matches) greater then or equal to 3 are assigned to charge state 2 and peptides with a difference less than or equal to 4 are assigned to 3, the resulting error rates for 2 and 3 assignments were less than 1% (0.4% for 2 and 0.6% for 3). Peptides with differences in between these thresholds would be considered ambiguous and would be searched at both charge states. Figures 1–3 demonstrate the three possible scenarios for precursor charge state selection based on the difference in complementary ion matches between the 2 and 3 charge states. Figure 1 shows a MS/MS spectrum of a doubly charged peptide with a m/z of 875.72. Assuming the dissociated peptide ion was doubly charged, the total number of matched complementary ions was 9; while assuming the peptide was triply charged matches, the total number of matched complementary ions was only 7. The difference in complementary ions assigns this MS/MS spectrum as a doubly charged precur-

FIG. 1. MS/MS spectrum of a doubly charged peptide with a m/z of 875.72. The matched complementary ions for each possible precursor ion charge state are labeled. Fragment ions involved in multiple complementary ion pairs are only labeled once. The charge state is correctly identified as 2 as the number of complementary ions pairs matching the doubly charged peptide are above the decision threshold.

239

5847_03_p233-250

9/22/05

1:05 PM

Page 240

HOGAN ET AL.

FIG. 2. MS/MS spectrum of a triply charged peptide with a m/z of 547.98. The matched complementary ions for each possible precursor ion charge state are labeled. Fragment ions involved in multiple complementary ion pairs are only labeled once. The charge state is correctly identified as 3 as the number of complementary ions pairs matching the triply charged peptide are above the decision threshold.

sor ((9  5)  7  7). The most abundant fragment ion in the spectrum was determined to result from the loss of a neutral water molecule, after manual inspection of the mass difference of 9 Da between the fragment ion and the dissociated precursor ion. This fragment ion is not matched to any complement ion as the fragment ion is doubly charged. A case demonstrating a triply charged precursor ion with a m/z of 547.98 is shown in Figure 2. This spectrum contains a larger number of matched complementary ions related to a triply charged (14) than a doubly charged precursor (4). Figure 3 shows the MS/MS spectra of a precursor ion with a m/z of 432.11. This spectrum demonstrates a case where the total number of complementary fragment ions assigned to both a doubly and triply charge precursor ion falls within the decision thresholds. As a result, both the 2 and 3 charge states would be searched using SEQUEST and estimated by the LIPS model. The difference in complimentary ion matches was combined with the percentage of fragment ion intensity greater the precursor m/z and twice the precursor m/z, using a logistic regression model fit to the training data. This approach is similar to the approach described in equations (1) and (2) for the LIPS models. All these predictors significantly improved the fit to the training data after accounting for the others (p-values all less than 0.001). The resulting model: Lcs  8.88  14.19  X1  29.38  X2  0.99  X3

(3)

eLcs Pcs   1  eLcs

(4)

and

240

5847_03_p233-250

9/22/05

1:05 PM

Page 241

CHASTE FOR TANDEM MASS SPECTROMETRY PROTEOMICS

FIG. 3. MS/MS spectrum of a triply charged peptide with a m/z of 432.11. The matched complementary ions for each possible precursor ion charge state are labeled. Fragment ions involved in multiple complementary ion pairs are only labeled once. The charge state is unable to be determined since the number of complementary ions pairs are within the decision threshold.

where X1  % intensity greater than precursor m/z, X2  % intensity greater than twice precursor m/z, X3  ((2 matches 5)  3 matches), and pCS is the estimated probability of 3 charge state. Thresholds were chosen to match the error rates of the difference in complimentary matches (pCS  0.96 is 3 charge state, pCS  0.06 is 2 charge state, and otherwise charge state is not assigned). As seen in Table 3 the logistic model increased sensitivity for 2 and 3 identification and thereby reduced the percentage of unassigned charge states from 19.4% to 12.9%. Examples of this model can be seen in Figures 1–3, where the 2 spectrum (Fig. 1) has fitted probability pCS near 0, the 3 spectra (Fig. 2) has a probability near 1, and the last spectrum (Fig. 3) has an ambiguous pCS value of 0.24.

Validation of charge state tests Table 4 shows the results of the different 1 tests obtained on our first test dataset. A minimum percentage intensity threshold of 15% of the total ion intensity performed slightly better than using either a 20% minimum intensity threshold or a threshold requiring no more than one fragment ion peak above the precursor m/z to have an intensity greater than 5% relative abundance, when using the same standard criterion as in the training dataset. Lessening the standard criterion for known charge states to include lower quality matches to the known protein or peptide standards appears reduce the accuracy of the 1 tests. Accuracy declines from 99.3% to 96.9% using the 15% threshold. However, part of this decline is due to random peptide matches of known standard proteins or peptides searched for an incorrect charge state. To es241

5847_03_p233-250

9/22/05

1:05 PM

Page 242

TABLE 4.

ACCURACY

OF THE

1 TEST

ON THE

Matched to Protein or Peptide Standard and LIPS Probability  0.9 ActualCS Predicted CS 1 2/3 20% GT PC

1 2/3

120 2

FIRST TEST DATASET Sensitivity, FP rate and Accuracy

1 Sens  98.1% 1 FP rate  9.1% 2/3 Sens  2/3 FP rate  0.2% 99.1%

12 1328

Accuracy  99.0% 15% GT PC

1 2/3

94 4

1 Sens 95.9% 2/3 Sens 99.9%

1 1151

1 FP rate  1.1%  2/3 FP rate  0.3%

Accuracy  99.6% 2 peaks  5%

1 2/3

96 2

1 Sens  98.0% 2/3 Sens 99.1%

10 1142

1 FP rate  9.7%  2/3 FP rate  1.7%

Accuracy  99.0% Matched to Protein or Peptide Standard Only Actual CS Predicted CS 1 2/3 20% GT PC

1 2/3

417 27

Sensitivity, FP rate and Accuracy 1 Sens  93.9% 1 FP rate  9.2% 2/3 Sens  2/3 FP rate  1.6% 97.5%

42 1672

Accuracy  96.8% 15% GT PC

1 2/3

411 33

1 Sens  92.6% 1 FP rate  7.4% 2/3 Sens  2/3 FP rate  1.9% 98.1%

33 1681

Accuracy  96.9% 2 peaks  5%

1 2/3

408 36

1 Sens  91.7% 1 FP rate  8.3% 2/3 Sens  2/3 FP rate  1.5% 98.4%

37 1677

Accuracy  96.6% Matched to Protein or Peptide Standard Adjusted for Random Matches Actual CS Predicted CS 1 2/3 20% GT PC

1 2/3

430 7

Sensitivity, FP rate and Accuracy

1 Sens  98.4% 1 FP rate  6.3% 2/3 Sens  2/3 FP rate  0.4% 98.3%

29 1692

Accuracy  98.3% 15% GT PC

1 2/3

424 13

1 Sens  97.0% 1 FP rate  4.5% 2/3 Sens  2/3 FP rate  0.8% 98.8%

20 1701

Accuracy  98.5% 2 peaks  5%

1 2/3

421 16

1 Sens  96.3% 1 FP rate  5.4% 2/3 Sens  2/3 FP rate  0.9% 98.6%

24 1697

Accuracy 98.1%

242

5847_03_p233-250

9/22/05

1:05 PM

Page 243

CHASTE FOR TANDEM MASS SPECTROMETRY PROTEOMICS TABLE 5.

ACCURACY

OF THE

2/3 TEST

ON THE

Matched to Protein or Peptide Standard and LIPS Probability  0.9 Actual CS Predicted CS 2 3 2/3 Test

2 2/3 3

957 159 5

FIRST TEST DATASETa Sensitivity, FP rate and Accuracy 2 Sens  85.3% 2 FP rate  0.1% Not assigned CS  14.2% 3 Sens  74.3% 3 FP rate  3.1%

1 53 153

Accuracy  99.0% Logistic 2/3 Test

2 2/3 3

1029 102 2

2 Sens  90.8% 2 FP rate  0.2% Not assigned CS  11.1% 3 Sens  76.3% 3 FP rate  1.3%

2 47 158

Accuracy  99.7% Matched to Protein or Peptide Standard Only Actual CS Predicted CS 2 3 2/3 Test

2 2/3 3

1121 249 9

Sensitivity, FP rate and Accuracy 2 Sens  81.3% 2 FP rate  1.3% Not assigned CS  20.6% 3 Sens  62.1% 3 FP rate  4.7%

15 96 182

Accuracy  98.2% Logistic 2/3 Test

2 2/3 3

1247 164 7

2 Sens  88.0% 2 FP rate  1.7% Not assigned CS  13.8% 3 Sens  67.6% 3 FP rate  3.5%

22 74 200

Accuracy  98.0% aNote the accuracy value excludes those peptides assigned to 2/3, but these values are included when calculating sensitivity.

timate how many of these random matches may have occurred, the MS/MS runs were also searched against a reversed sequence database. This resulted in 69 random matches to our protein or peptide standards (30 at 1 charge state, 21 at 2 charge state, and 18 at 3 charge state. We assumed that randomly matched peptides would match any of the 3 charge states with equal probability, thus, two thirds of the 1 matches are actually 2 or 3 and one third of the 2/3 matches are actually 1. Also, since the 1 test has accuracy close to 1, the actual charge state of these random peptides was identified correctly by the 1 test. Thus, 1 false positives are reduced by 20, and 2/3 false positives are reduced by 13. As Table 4 indicates, under these assumptions, the accuracy is much less reduced and only declines to 98.5% in the case of the 15% threshold. Accuracy of the 2/3 test in the first test dataset is quite similar to the training set as can be seen in Table 5. This is the case for both the complimentary matches and the logistic regression model. There is a similar reduction in the percentage of unassigned charge states as well. Allowing lower quality matches reduces accuracy of the test again, although not to the extent of the 1 test. Some of the false positives here are also likely due to random matches to the protein or peptide standards. The percentage of unassigned charge states also increases with lower quality matches, but once again the logistic model has fewer of these unassigned charge state with a similar false positive rate. Very similar results were obtained by validating the charge state test on the second dataset (six LCQ MS/MS runs; Tables 6 and 7), with the 15% threshold being generally the best. Accuracy on the third test dataset (1 LTQ run) was slightly lower, particularly with the 20% threshold (Tables 8 and 9). (This reiterates that thresh243

5847_03_p233-250

9/22/05

1:05 PM

Page 244

HOGAN ET AL. TABLE 6.

ACCURACY

OF THE

1 TEST

ON THE

Matched to Protein or Peptide Standard and LIPS Probability  0.9 Actual CS Predicted CS 2 3 20% GT PC

1

63

5

2/3

0

949

SECOND TEST DATASET Sensitivity, FP rate and Accuracy 1 Sens 100.0% 2/3 Sens 99.5%



1 FP rate  7.4%

 2/3 FP rate  0.0%

Accuracy  99.6% 15% GT PC

1

63

4

2/3

0

950

1 Sens 100.0% 2/3 Sens 99.6%



1 FP rate  5.9%

 2/3 FP rate  0.0%

Accuracy  99.7% 2 peaks 5%

1 2/3

49 14

1 Sens  77.8% 1 FP rate  9.2% 2/3 Sens  2/3 FP rate  1.5% 99.5%

5 949

Accuracy  98.3% Matched to Protein or Peptide Standard Only Actual CS Predicted CS 1 2/3 20% GT PC

1 2/3

198 3

Sensitivity, FP rate and Accuracy 1 Sens  98.5% 1 FP rate  13.9% 2/3 Sens  2/3 FP rate  0.3% 97.4%

32 1179

Accuracy  97.5% 15% GT PC

1 2/3

198 3

1 Sens  98.5% 1 FP rate  10.8% 2/3 Sens  2/3 FP rate  0.3% 98.0%

24 1171

Accuracy  98.1% 2 peaks  5%

1 2/3

158 43

1 Sens  78.6% 1 FP rate  9.7% 2/3 Sens  2/3 FP rate  3.6% 98.6%

17 1167

Accuracy  96.4%

olds are best chosen specific to the particular instrument, protocol, and pipeline.) The logistic regression model, based on the 2/3 test, reduces the number of unassigned charge states in both of these situations as well. Finally, this model is relatively skewed towards 2 charge state identifications indicating that thresholds (and even model parameters) should be calibrated for the particular analysis situation.

Reduction of number of DTA files In the first test dataset there were 7,536 MS/MS spectra leading to 22,382 DTA files. Using the 15% threshold, 2,278 spectra were assigned to a 1 charge state and the rest to 2, 3, or 2/3 charge states, 244

5847_03_p233-250

9/22/05

1:05 PM

Page 245

CHASTE FOR TANDEM MASS SPECTROMETRY PROTEOMICS TABLE 7.

ACCURACY

OF THE

2/3 TEST

ON THE

Matched to Protein or Peptide Standard and LIPS Probability  0.9 Actual CS Predicted CS 2 3 2/3 Test

2 2/3 3

581 169 3

7 72 122

SECOND TEST DATASETa Sensitivity, FP rate and Accuracy 2 Sens  77.2% 2 FP rate  1.2% Not assigned CS  25.2% 3 Sens  60.7% 3 FP rate  2.4% Accuracy  98.6%

Logistic 2/3 Test

2 2/3 3

651 101 1

7 95 99

2 Sens  86.6% 2 FP rate  2.0% Not assigned CS  20.8% 3 Sens  49.3% 2 FP rate  1.0% Accuracy  98.9%

Matched to Protein or Peptide Standard Only Actual CS Predicted CS 2 2/3 Test

2 2/3 3

704 236 5

Sensitivity, FP rate and Accuracy 3 17 101 141

2 Sens  74.4% 2 FP rate  2.4% Not assigned CS  25.6% 3 Sens  54.7% 2 FP rate  2.4% Accuracy  97.7%

Logistic

2 2/3 3

824 119 1

18 124 117

2 Sens  87.3% 2 FP rate  2.1% Not assigned CS  18.8% 3 Sens  45.2% 2 FP rate  0.9% Accuracy  98.1%

aNote the accuracy value excludes those peptides assigned to 2/3, but these values are included when calculating sensitivity.

leaving 12,678 DTA files (44.5% reduction). Of the remaining 5,258 spectra, 2,573 were assigned to a 2 charge state, 569 were assigned to a 3 charge state, and the rest were left as 2 or 3 charge states, based on the 2/3 logistic regression model, leaving 8,789 DTA files (further 16.2% reduction). The logistic regression model removed an additional 907 DTA files over the test based on complimentary peak matching alone. Altogether, the total reduction in the number of DTA files was 60.7% and thus the number of spectra that need to be redundantly searched at multiple charge states was reduced by 94.0%. Similar reductions in the number of DTA files to be searched were achieved with the second and third test datasets: 61.5% and 64.3%, respectively (94.8 and 97.3% reductions in redundant searches).

Impact of charge state tests on peptide identification The addition of the 1 test as predictor to our earlier LIPS model of the peptide identification significantly improved the fit of this model (p-value  0.0001). The further addition of the 2/3 test improved this LIPS model again (p-value  0.0001). The predictive value of adding the charge state tests as predictors or screening by the tests relative to our earlier base LIPS model searching at all 3 charge states can be seen in the ROC curves in Figure 4. They show that the 1 test noticeably improves the predictive ability of the LIPS model, but the further addition of the 2/3 test offers only a slight improvement in its predictive ability. Screening by the charge state tests does cause some minor loss in sensitivity (loss of some true positives) and only at very high rates of specificity (low false positive rates). For example, if we were to screen matches us245

5847_03_p233-250

9/22/05

1:05 PM

Page 246

HOGAN ET AL. TABLE 8.

ACCURACY

OF THE

1 TEST

ON THE

Matched to Protein or Peptide Standard and LIPS Probability  0.9 Actual CS Predicted CS 1 2/3 20% GT PC

1 2/3

146 2

THIRD TEST DATASET Sensitivity, FP rate and Accuracy

1 Sens  98.6% 1 FP rate  17.5% 2/3 Sens  2/3 FP rate  0.2% 96.5%

31 858

Accuracy  96.8% 15% GT PC 15% GT PC

1 2/3

145 3

1 Sens  98.0% 1 FP rate  6.4% 2/3 Sens  2/3 FP rate  0.3% 98.9%

10 879

Accuracy  98.7%

2 peaks 5% 2 peaks  5%

1 2/3

132 16

1 Sens  89.2% 1 FP rate  7.0% 2/3 Sens  2/3 FP rate  1.8% 96.5%

10 879

Accuracy  97.5% Matched to Protein or Peptide Standard Only Actual CS Predicted CS 1 2/3 20% GT PC

1 2/3

324 27

Sensitivity, FP rate and Accuracy 1 Sens  92.3% 1 FP rate  26.7% 2/3 Sens  2/3 FP rate  2.1% 91.1%

118 1205

Accuracy  91.2% 15% GT PC 15% GT PC

1 2/3

313 38

1 Sens  89.2% 1 FP rate  18.0% 2/3 Sens  2/3 FP rate  2.9% 94.8%

69 1254

Accuracy  93.5%

2 peaks 5% 2 peaks  5%

1 2/3

279 72

1 Sens  79.5% 1 FP rate  12.8% 2/3 Sens  2/3 FP rate  5.3% 96.9%

41 1282

Accuracy  93.2%

ing LIPS probabilities above 0.9 the results are quite similar: standard base model results in 1,456 correct peptide identifications with 17 false positives, while screening with CHASTE results in only 12 fewer correct peptide identifications with 2 fewer false positives as well. Using CHASTE as a predictor resulted in more correct identifications (1,522) at this threshold, but also slightly more false positives (22).

DISCUSSION As the results indicate, CHASTE is able to assign charge states with good accuracy to most of the 1, 2, and 3 MS/MS spectra generated by several different ion trap instruments. This reduces by at least 246

5847_03_p233-250

9/22/05

1:05 PM

Page 247

CHASTE FOR TANDEM MASS SPECTROMETRY PROTEOMICS TABLE 9.

ACCURACY

OF THE

2/3 TEST

ON THE

Matched to Protein or Peptide Standard and LIPS Probability  0.9 Actual CS Predicted CS 2 3 2/3 Test

2 2/3 3

538 174 10

2 52 113

THIRD TEST DATASETa Sensitivity, FP rate and Accuracy

2 Sens  86.5% 2 FP rate  0.4% Not assigned CS  25.4% 3 Sens  67.7% 3 FP rate  8.1% Accuracy  98.2%

Logistic 2/3 Test

2 2/3 3

672 48 2

9 85 73

2 Sens  93.1% 2 FP rate  1.3% Not assigned CS  15.0% 3 Sens  43.7% 3 FP rate  2.6% Accuracy  98.5%

Matched to Protein or Peptide Standard Only Actual CS 2/3 Test

Sensitivity, FP rate and Accuracy

Predicted CS

2

3

2 2/3 3

735 315 21

21 98 133

2 Sens  68.6% 2 FP rate  2.7% Not assigned CS  31.2% 3 Sens  52.8% 3 FP rate  13.6% Accuracy  95.4%

Logistic 2/3 Test

2 2/3 3

973 94 4

53 115 84

2 Sens  90.9% 2 FP rate  5.2% Not assigned CS  15.3% 3 Sens  33.3% 3 FP rate  4.5% Accuracy  94.9%

aNote the accuracy value excludes those peptides assigned to 2/3, but these values are included when calculating sensitivity.

60% the number of spectra that, in our case, SEQUEST is required to search and the number of redundant searches by over 90% with little or no reduction in the number of peptides identified or in the accuracy of those identifications. Identification of 1 charge state spectra can be done with good accuracy using only a simple measure of the percentage of intensity greater than the precursor m/z. The difference in the number of complimentary matching fragment ions provides very good discrimination between the 2 and 3 charge states. This cannot always completely separate them with high certainty, so some fraction needs to be left as undetermined. Since there are more possible complimentary fragments for a 3 peptide (1 and 2 fragments versus only 1) adding 5 to the 2 as a cosmetic correction makes the distributions of the difference more symmetric around 0. Combining complimentary match difference data with the distribution of peak intensity (percentage of the precursor m/z and twice the precursor m/z) in a logistic regression model improves discrimination of 2 and 3 charge states and thus reduces the number of undetermined charge states without increasing the error rate. CHASTE’s ability to discriminate between the charge states decreased somewhat when lower quality peptide matches were included in the validation set. This, however, did not impact peptide identification, since these lower quality matches are not useful for identifying correct peptides. The results were consistent using different LCQ runs from two different labs, but the accuracy was a bit lower for the LTQ test dataset when using the CHASTE model calibrated to our own LCQ training dataset. This reiterates to the need to calibrate the models for specific instruments, labs, protocols and organisms. Running known standards on the specific machine with the specific protocol and database can be easily used to generate data to train and validate the model. Using high confidence peptide identifications (such 247

5847_03_p233-250

9/22/05

1:05 PM

Page 248

HOGAN ET AL.

FIG. 4. ROC curves for models including charge state (CS) tests as predictors in LIPS models and screening by CS tests compared to the standard base LIPS model. (A) Slight increase in predictive ability of adding CHASTE as predictor and the slight drop in sensitivity only at lower specificity when screening by CHASTE. (B) Enlarged top-left corner (high sensitivity and low specificity) insert of the top figure.

248

5847_03_p233-250

9/22/05

1:05 PM

Page 249

CHASTE FOR TANDEM MASS SPECTROMETRY PROTEOMICS as those with LIPS probabilities 0.9) can create a gold standard dataset of known charge states to adjust thresholds and estimate parameters in logistic regression models. CHASTE was trained and validated using MS/MS spectra from the most commonly used ion trap mass spectrometers for high-throughput peptide/protein identification, the LCQ and LTQ, combined with the most commonly used search algorithm SEQUEST. However, CHASTE is not at all dependent on the database search algorithm, and should apply equally well to other search algorithms and to de novo sequencing approaches. The principals of CHASTE should also apply to any ion trap or triple quadrupole instrument, as long as the algorithm is trained and validated with respect to that instrument, using known standard protein mixtures. Finally, CHASTE was implemented as Java GUI-based and command-line-based interfaces, available free for research purposes.

CONCLUSION CHASTE provides a simple and accurate approach for estimating charge states of peptide MS/MS spectra from mass spectrometers lacking sufficient resolution to directly determine the charge state from the isotope spacing. The utility of CHASTE was demonstrated using datasets comprised of fully characterized MS/MS spectra and unknown MS/MS spectra from repetitive LC/MS/MS runs of a standard peptide and protein mixture obtained on Thermo Electron’s LCQ and LTQ ion trap mass spectrometers. This greatly reduces the need to repetitively search MS/MS spectra at different charge states, thereby helping alleviate one the major bottlenecks in high-throughput peptide and protein identification. CHASTE relies on straightforward measures of fragment ion peak distributions and combines different measures to improve accuracy using reliable statistical models (logistic regression). Thresholds and parameter estimates can be tailored and validated to specific analysis situation by using data gathered on known protein standards.

ACKNOWLEDGMENTS We greatly appreciate suggestions and comments of Gerald van Belle, Paul Edlefsen, members of the Shewanella Federation, and especially Jim Fredrickson. We also appreciate expertise and efforts of the Pacific Northwest National Laboratory’s EMSL Proteomics Facility and specifically Richard Smith, Gordon Anderson, Mary Lipton, Ken Auberry, and Sam Purvine for compiling the LCQ datasets and Thermo Electron Corporation’s (San Jose, CA) Vlad Zabrouskov and Kevin Wheeler for compiling the LTQ dataset. The Department of Energy’s OBER and OASCR Genomics: GTL program under grant DE-FG08-01ER63218 to E.K supported this research.

REFERENCES AEBERSOLD, R., and GOODLETT, D.R. (2001). Mass spectrometry in proteomics. Chem Rev 101, 269–295. COLINGE, J., MAGNIN, J., DESSINGY, T., et al. (2003). Improved peptide charge state assignment. Proteomics 3, 1434–1440. DANCIK, V., ADDONA, T.A., CLAUSER, K.R., et al. (1999). De novo peptide sequencing via tandem mass spectrometry. J Comput Biol 6, 327–342. ENG, J.K., MCCORMACK, A.L., and YATES, J.R. (1994). An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom 5, 976–989. FENN, J.B., MANN, M., MENG, C.K., et al. (1989). Electrospray ionization for mass spectrometry of large biomolecules. Science 246, 64–71. HEIDELBERG, J.F., PAULSEN, I.T., NELSON, K.E., et al. (2002). Genome sequence of the dissimilatory metal ionreducing bacterium Shewanella oneidensis. Nat Biotechnol 20, 1118–1123. HIGDON, R., KOLKER, N., PICONE, A., et al. (2004). LIP index for peptide classification using MS/MS and SEQUEST search via logistic regression. OMICS 8, 357–369. KARAS, M., and HILLENKAMP, F. (1988). Laser desorption ionization of proteins with molecular masses exceeding 10,000 daltons. Anal Chem 60, 2299–2301.

249

5847_03_p233-250

9/22/05

1:05 PM

Page 250

HOGAN ET AL. KLAMMER, A.A., WU, C.W., MACCOSS, M.J. and NOBLE, W.S. (2005). Peptide charge state determination for lowresolution tandem mass spectra. Proceedings of the Computational Systems Bioinformatics Conference, August 8–11, 2005, Stanford, CA. pp. 175–185. KOLKER, E., PURVINE, S., GALPERIN, M.Y., et al. (2003). Initial proteome analysis of model microorganism Haemophilus influenzae strain Rd KW20. J Bacteriol 185, 4593–4602. KOLKER, E., PICONE, A.F., GALPERIN, M.Y., et al. (2005). Global profiling of Shewanella oneidensis MR-1: expression of hypothetical genes and improved functional annotations. Proc Natl Acad Sci USA 102, 2099–2104. LEIBLER, D.C. (2002). Introduction to Proteomics (Humana Press, Totowa, NJ). MACCOSS, M.J., WU, C.C., and YATES, J.R., 3RD (2002). Probability-based validation of protein identifications using a modified SEQUEST algorithm. Anal Chem 74, 5593–5599. MCCULLAGH, P., and NELDER, J.A. (1999). Generalized Linear Models (Chapman Hall, London). PEPE, M.S. (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction. (Oxford University Press, Oxford). PERKINS, D.N., PAPPIN, D.J., CREASY, D.M., et al. (1999). Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551–3567. PURVINE, S., KOLKER, N., and KOLKER, E. (2004a). Spectral quality assessment for high throughput tandem mass spectrometry proteomics. OMICS 8, 255–265. PURVINE, S., PICONE, A.F., and KOLKER, E. (2004b). Standard mixtures for proteome studies. OMICS 8, 79–92. SADYGOV, R.G., ENG, J., DURR, E., et al. (2002). Code developments to improve the efficiency of automated MS/MS spectra interpretation. J Proteome Res 1, 211–215. TABB, D.L., ENG, J.K., and YATES, J.R., 3RD (2001). Mass spectrometry. In Proteome Research (Springer, New York), pp. 125–142. VAN BERKEL, G.J., ASANO, K.G., and SCHNIER, P.D. (2001). Electrochemical processes in a wire-in-a-capillary bulk-loaded, nano-electrospray emitter. J Am Soc Mass Spectrom 12, 853–862. WASHBURN, M.P., WOLTERS, D., and YATES, J.R., 3RD (2001). Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat Biotechnol 19, 242–247.

Address reprint requests to: Dr. Eugene Kolker BIATECH Non-Profit Research Center 19310 N. Creek Pkwy., Ste. 115 Bothell, WA 98011 E-mail: [email protected]

250