5847_03_p233-250
9/22/05
1:05 PM
Page 233
OMICS A Journal of Integrative Biology Volume 9, Number 3, 2005 © Mary Ann Liebert, Inc.
Charge State Estimation for Tandem Mass Spectrometry Proteomics *JASON
M. HOGAN, *ROGER HIGDON, NATALI KOLKER, and EUGENE KOLKER
ABSTRACT High-throughput protein analysis by tandem mass spectrometry produces anywhere from thousands to millions of spectra that are being used for peptide and protein identifications. Though each spectrum corresponds only to one charged peptide (ion) state, repetitive database searches of multiple charge states are typically conducted since the resolution of many common mass spectrometers is not sufficient to determine the charge state. The resulting database searches are both error-prone and time-consuming. We describe a straightforward, accurate approach on charge state estimation (CHASTE). CHASTE relies on fragment ion peak distributions, and by using reliable logistic regression models, combines different measurements to improve its accuracy. CHASTE’s performance has been validated on data sets, comprised of known peptide dissociation spectra, obtained by replicate analyses of our earlier developed protein standard mixture using ion trap mass spectrometers at different laboratories. CHASTE was able to reduce number of needed database searches by at least 60% and the number of redundant searches by at least 90% virtually without any informational loss. This greatly alleviates one of the major bottlenecks in high throughput peptide and protein identifications. Thresholds and parameter estimates can be tailored to specific analysis situations, pipelines, and instrumentations. CHASTE was implemented in Java GUI-based and command-line-based interfaces. INTRODUCTION
M
has recently become the most common approach for high throughput protein sample analysis (Aebersold and Goodlett, 2001; Kolker et al., 2003; Leibler, 2002; Washburn et al., 2001). Complex protein mixtures are typically converted into peptides through enzymatic or chemical cleavage. The most commonly used enzyme for protein digestion is trypsin, as the resulting peptides are often smaller and more amenable to positive ion analysis. The resulting peptides are usually subjected to gel electrophoresis or chromatographic separation to reduce the mixture complexity. The separated peptides are ionized by electrospray (Fenn et al., 1989), nanospray (Van Berkel et al., 2001), or matrix-assisted laser desorption ionization (Karas and Hillenkamp, 1988) and introduced into a mass spectrometer. Peptide analyses have been performed on a variety of instruments, including ion trap, triple quadrupole, quadrupole time-of-flight, tandem time-of-flight, and Fourier transform ion cyclotron resonance instruments. Tandem mass specASS SPECTROMETRY
BIATECH, Bothell, Washington. *Authors contributed equally to this study and should be considered first authors.
233
5847_03_p233-250
9/22/05
1:05 PM
Page 234
HOGAN ET AL. trometry (MS/MS) is used to obtain peptide sequence information by interpretation of the resulting fragment ion spectrum. Peptide and protein identifications are typically accomplished via database searching of uninterpreted peptide MS/MS spectra. The two most commonly used programs are SEQUEST (Eng et al., 1994; MacCoss et al., 2002) and Mascot (Perkins et al., 1999). These algorithms compare the experimental peptide fragmentation spectrum to theoretical peptide product ion spectra created in silico from amino acid sequences contained in a protein sequence database. As protein sequence databases continue to expand, the increasing numbers of protein sequences pose challenges to peptide identification software to correctly relate a MS/MS spectrum to a peptide sequence present in the protein database. The use of electrospray or nanospray ionization also adds to this challenge due to the ability of these methods to produce multiply charged ions. Multiply charged ions typically provide more peptide sequence information than singly charged ions allowing better peptide identification, especially when searching for post-translational modifications. The drawback of MS/MS spectra of multiply charged ions is that the fragmentation spectra are more complex. Though high-end instruments can observe the isotopic spacing of precursor and fragment ions to interpret their charge state, most commonly used ion trap and triple quadrupole instruments generally have inadequate resolution to do so. To overcome this limitation, database searches are typically conducted assuming all possible charge states of the precursor ion and interpreting the search results to determine the charge state. This approach increases the overall time of the database search, as well as the possibility of false positive identifications due to the increased number of possible peptide candidates. The necessity of knowing the precursor ion charge state is not limited to database search algorithms, but is a necessary component of de novo sequencing methods as well (Dancik et al., 1999). Many proteomics laboratories have implemented straightforward approaches to differentiate 1 charged spectra from multiply charged ones (Tabb et al., 2001) and a few have implemented approaches to discriminate 2 and 3 spectra (Colinge et al., 2003; Dancik et al., 1999; Sadygov et al., 2002). These are largely based on identifying the presence of complimentary fragment pairs present in the MS/MS spectra under the assumed charge state and comparing the distribution of fragment peaks relative to the precursor mass to charge ratio. However, very little has been published about the accuracy and reliability of such approaches. Recently, a support vector machine classifier was developed to differentiate MS/MS spectra of doubly and triply charged peptides while maintaining almost all true positive peptide identifications (Klammer et al., 2005). The support vector machine utilizes 34 different features to predict the peptide charge state. Despite whatever approach is used for charge state determination, it is unlikely that it will apply universally to diverse types of protocols, pipelines, and instrumentation. Therefore, there is a clear need for an accurate, generalized, and statistically solid approach that can be tailored and validated for each specific analysis situation. We herein present such an approach for charge state estimation (CHASTE) for peptide spectra with 1, 2, or 3 charge states. CHASTE relies on fragment ion peak distributions, and by using reliable logistic regression models, combines different measures to improve its accuracy. CHASTE’s performance has been validated on data sets obtained on two ion trap mass spectrometers (the LCQ and LTQ from Thermo Electron). These data sets were obtained at three different labs by analyzing our earlier developed protein standards (Purvine et al., 2004b) with the most commonly used database search program, SEQUEST. CHASTE was able to reduce number of needed database searches by at least 60% and the number redundant searches by at least 90% virtually without any informational loss. This greatly alleviates one of the major bottlenecks in high throughput peptide and protein identifications. CHASTE’s thresholds and parameter estimates can be easily tailored to specific analysis situations, pipelines, and instruments. CHASTE was implemented in Java GUI-based and command-line-based interfaces that are available for free for research purposes.
MATERIALS AND METHODS Creation of data sets A standard mixture of peptides from stand-alone and protein digest sources was described in our recent study (Purvine et al., 2004b). This mixture was analyzed multiple times via liquid chromatography tandem 234
5847_03_p233-250
9/22/05
1:05 PM
Page 235
CHASTE FOR TANDEM MASS SPECTROMETRY PROTEOMICS mass spectrometry (LC/MS/MS). This mixture is comprised of 23 individually characterized tryptic peptides and 12 proteins, all commercially available from Sigma (St. Louis, MO). After LC/MS/MS analysis of the individual protein digests, the proteins were mixed at concentrations described previously (Purvine et al., 2004b), reduced and alkylated, digested with modified porcine trypsin (V511A, Promega, Madison, WI). These mixtures were combined with two respective mixtures, equimolar and normalized (to yield similar peak intensities), of the 23 peptides. This final mixture, comprised of approximately 250 tryptic peptides, was analyzed as described in detail in (Kolker et al., 2005; Purvine et al., 2004b) at two labs, BIATECH and Pacific Northwest National Laboratory (PNNL) using a LCQ DecaXP Plus ion trap mass spectrometer, and at Thermo Electron using a LTQ ion trap mass spectrometer (Thermo Electron Corp., San Jose, CA). The MS/MS data from the mixture analyses were then analyzed using the SEQUEST Browser (packaged with the Bioworks software tools from Thermo Electron), wherein MS/MS scans were converted to flat data files (DTA files) of assumed 1, 2, and 3 charge states. These DTA files were searched via the SEQUEST algorithm against a database of the 23 peptides and 12 proteins in our standards, known contaminants, and all proteins from Shewanella oneidensis strain MR-1 (Heidelberg et al., 2002; Kolker et al., 2005). The SEQUEST outputs were processed with our earlier developed Logistic Identification of Peptide Sequences (LIPS) models (Higdon et al., 2004). LIPS models fit the logit (log odds) of the probability of correct identification, (LIPS index) as a linear combination of the predictors (McCullagh and Nelder, 1999). These predictors currently include: cross-correlation score (Xcorr), relative difference between the first and second highest cross-correlations (Cn), peptide length (PL), charge state (CS), and number of tryptic termini (NTT): LIPS 0 1 Xcorr 2 Cn 3 PL 4 CS 5 NTT
(1)
where default values for this standard base LIPS model are as follows: 0 7.1; 1 2.6; 2 13.4; 3 0.14; 4 1.0; and 5 2.9. The weights were calculated using logistic regression models that were first trained on one subset of the above protein standards and then validated on another subset as described in detail in our previous work (Higdon et al., 2004). These LIPS indexes can be then easily converted into estimates of the probability of a correct peptide match: eLIPS p 1 eLIPS
(2)
Determination of charge state 1 Theoretically the MS/MS spectrum of a peptide with a 1 charge state should contain no fragment ion peaks with a m/z greater than the precursor m/z value. Accordingly, peptides with charge state 2 or higher should contain fragment ion peaks greater than the precursor m/z. Ideally one should be able to check for the presence of m/z peaks greater than the precursor to determine 1 charge state. However, MS/MS spectra generally contain a number of noise peaks. As part of our previously described SPEQUAL approach (Purvine et al., 2004a), a charge state 1 test was conducted by filtering out all peaks below a given intensity (SPEQUAL’s default value is 5% of the maximum intensity peak). If the number of remaining peaks at m/z greater than the precursor m/z is less than a given threshold (default value of two peaks), the peptide is assigned the 1 charge state. Otherwise, the peptide is assumed to be charge state 2 or higher. In addition to this method, CHASTE also calculates the percent of all peak intensities at m/z greater than the precursor (regardless of relative intensity).
Discrimination of charge states 2 and 3 Tandem mass spectrometry of multiply charged peptide ions will produce complementary sets of fragment ions, so the total mass and charge of the precursor ion are conserved. Provided both fragment ions are within the mass range of the instrument, complementary sets of singly charged ions will typically be obtained from doubly charged peptide ions. Accordingly, triply charged peptide ions will produce complementary fragment ion sets containing one singly and one doubly charged fragment ion. (The only excep235
5847_03_p233-250
9/22/05
1:05 PM
Page 236
HOGAN ET AL. tion to this rule involves neutral losses, where the total charge of the precursor ion is retained on one of the fragment ions and the neutral fragment ion is not observed.) To discriminate between 2 and 3 ions (2/3 tests), CHASTE implements the following procedure. First, the fragment ion peaks are sorted, based on their abundance, and the top peaks are retained in the fragment list (CHASTE’s default value is 100). Second, any isotopic peaks within a short m/z range surrounding each fragment ion (CHASTE’s default values are: 2.5 m/z above and 2.0 m/z below each fragment ion) are removed from this fragment list, starting with the most abundant fragment ion. Third, the subset of the top most abundant fragment ion peaks (CHASTE’s default value is 15) is used as the search list. Next, in the case of spectra assumed to be from doubly charged peptides, each m/z value in this search list is then compared to each fragment ion in the fragment list. Such comparisons are done to find the matches that will sum to the original singly charged precursor ion mass plus a proton within a reasonable mass tolerance (CHASTE’s default value is 3 Da). Each matching pair of complementary ions is then counted. (Each m/z value in the search list may only be counted once, and redundant matches are discarded.) Doubly charged fragment ions have no observable complementary ion and will only give rise to random matches. In contrast, MS/MS spectra of triply charged peptides can produce fragment ions that are singly, doubly, and triply charged. Since the charge state of each m/z value in the search list is unknown, it must be searched as if it were a singly and doubly charged fragment ion. This results into two independent searches. After adjusting for the m/z values in the search list for the charge state, each m/z value is again compared to each fragment ion in the fragment list as done for the doubly charged case. Each m/z value in the search list can only be counted once, but a m/z value will be counted twice if it matches a complementary ion that is singly charged and a separate complementary ion that is doubly charged. Triply charged fragment ions, in this case, have no observable complementary ions and will also give rise to random matches as in the doubly charged case. Finally, after counting all the matched complementary ion pairs for both the doubly and triply charged case, the difference between the two values (2 matches minus 3 matches) is used as a predictor for determining the precursor peptide charge state. In addition to the number of matching complimentary ion pairs, the distribution of peaks in relation to precursor m/z also contains information about the precursor charge state. For the same reason as described for the MS/MS spectra of charge state 1, spectra at charge state 2 should contain no fragment ion peaks above twice the precursor m/z. Additionally, spectra at charge state 3 should also contain a greater proportion of fragment ion peaks at greater than the precursor m/z. To measure these CHASTE uses the percent of all peak intensities at m/z greater than the precursor m/z (as was used to test for charge state 1) and the percent of all peak intensities at m/z greater than twice the precursor m/z (regardless of relative intensity). These measures are being combined with the difference in the number complimentary fragment ion matches between charge states 2 and 3 described above by using logistic regression models (McCullagh and Nelder, 1999), similar to our LIPS model, to create a compound predictor of charge state used in CHASTE (2/3 tests).
Training dataset Data from four LC/MS/MS runs of our protein standard mixtures described above were used as a training set for determining the thresholds and for training the CHASTE logistic regression models. To obtain a set of known peptide ion charge states, only MS/MS spectra that were matched to proteins contained in our set of standard peptides and proteins and were determined to have high quality matches (high confidence peptide identifications with LIPS probabilities greater than 0.9) were retained. This results in a training set with very high certainty of correct peptide identifications and thus correct precursor peptide ion charge states. The training set contains a total of 1,250 peptide MS/MS spectra, of which 98 were charge state 1, 877 were 2, and 275 were 3.
Test datasets and validation of the charge state determinations An additional six LC/MS/MS runs on our protein standard mixture were used as a first test dataset to validate the charge state determinations. This resulted in 7,536 MS/MS spectra leading to 22,382 DTA files at charge states 1, 2 and 3 that were searched by SEQUEST (note that 226 3 DTA files were not created due to restrictions on the precursor mass). The same criterion as in the previous section was used to generate 236
5847_03_p233-250
9/22/05
1:05 PM
Page 237
CHASTE FOR TANDEM MASS SPECTROMETRY PROTEOMICS a set of peptides with known charge states (high confidence peptide identifications with LIPS probabilities 0.9). Additionally, a second less stringent criterion was then used to increase the size of the test dataset and include lesser quality spectra. For this criterion a peptide was considered correctly assigned if the following conditions were met: (a) peptide was matched to one of the standard proteins or peptides; (b) its length is at least 5 amino acids; and (c) peptide cannot be matched to standard proteins or peptides at more than one charge state. Obviously some of these peptides will be matched to the incorrect charge state due to the usual random matching of incorrect peptides. To adjust for this, the number of random matches to correct proteins or peptides was estimated by searching the spectra against a reversed sequence database (Higdon and Kolker, unpublished data). In addition to the above dataset, two additional test datasets, one obtained by PNNL (six LCQ runs) and the other by Thermo Electron (1 LTQ run), were analyzed to validate CHASTE’s performance using procedures similar to the ones described above. Estimated charge states based on the 1 and 2/3 tests described above were compared to the known charge states for these datasets. Measures of sensitivity, specificity, overall accuracy, and false positive rate as defined in Table 1 were calculated for different thresholds and versions of the 1 and 2/3 tests.
Evaluation of predictive value of the charge state determinations This study included an evaluation of any potential gain or loss in the ability to predict which peptides are correctly identified by SEQUEST and the LIPS standard default model by screening out DTA files using the 1 and 2/3 tests. Additionally, we examined the tests’ utility as an additional indicator of peptide match quality (peptide identification probability). This was done by incorporating two variables indicating whether a DTA file would be kept as a result of passing the 1 test or both the 1 and 2/3 tests into our LIPS models. The models were fit on the training dataset (four MS/MS runs) described above. All matches to contaminants, peptides shorter than five amino acids, and peptides with multiple matches to standard proteins at different charge states were excluded. For purposes of training the models, peptide matches were considered correct if they were matched to one of the standard proteins or peptides, and not a Shewanella oneidensis protein. Models were compared to our standard LIPS model (eq. 1). Each of the above indicator variables TABLE 1.
DEFINITIONS
OF
ACCURACY MEASURES
Actual Classification (“Gold” Standard)
Test Classification
Positive
Negative
Positive
True Positive (tp)
False Positive (fp)
Np
Negative
False negative (fn)
True negative (tn)
Nn
Np
Nn
Nn
tp tn Accuracy: Acc N.. tp Sensitivity: Se N.p tn Specificity: Sp N.n fp False positive rate: FPR Np. fn False negative rate: FNR Nn. Depending on the situation, positive and negative correspond to different outcomes. For peptide identification, positive is a correct match and negative is an incorrect match; for the 1 test, positive is 1 charge state and negative is 2 or higher; and for the 2/3 test, positive is 2 and negative is 3. However, the latter assignment is arbitrary, so specificity for 2 is equivalent to sensitivity for 3 and the false negative rate for 2 is equivalent to the false positive rate for 3, and so on.
237
5847_03_p233-250
9/22/05
1:05 PM
Page 238
HOGAN ET AL. TABLE 2.
20% GT PC
ACCURACY
OF THE
Predicted CS
1
1 2/3
97 1
1 TEST
ON THE
TRAINING DATASET
Actual CS 2/3 1 1151
Sensitivity, FP rate and Accuracy 1 Sens 99.0% 1 FP rate 1.0% 2/3 Sens 2/3 FP rate 0.1% 99.9% Accuracy 99.8%
15% GT PC
1 2/3
120 2
8 1332
1 Sens 98.1% 2/3 Sens 99.4%
1 2/3
Accuracy 99.3% 2 peaks 5%
1 2/3
118 4
10 1330
1 Sens 96.7% 2/3 Sens 99.3%
1 2/3
Accuracy 99.0%
for the charge state tests were added to this model and the models were also fit using the standard model predictors after removal of DTA files failing the 1 test and both the 1 and 2/3 tests, respectively. The models were validated against the first test dataset described above. Receiver Operator Characteristic (ROC) curves were generated for each model. A ROC curve is a plot of the Sensitivity versus (1 minus Specificity) for all possible thresholds of fitted probabilities (eq. 2) from the LIPS models (Table 1 defines accuracy measures used in this study). These curves show the trade off between finding correct peptide identifications and making false positive identifications. Greater area underneath the curve is indicative of better predictive ability (Pepe, 2003).
RESULTS Determination of initial thresholds To differentiate 1 from 2/3 MS/MS spectra, a minimum percentage intensity threshold, based on the percentage of the total ion intensity contained in fragment ion peaks with m/z values greater than the TABLE 3.
2/3 Test
ACCURACY
OF THE
2/3 TEST
Predicted CS
Actual CS 2 3
2 2/3 3
754 121 1
3 103 170
ON THE
TRAINING DATASET Sensitivity, FP rate and Accuracy
2 Sens 86.1% Not assigned CS 19.4% 3 Sens 61.6%
2 FP rate 0.4% 3 FP rate 0.6%
Accuracy 99.5% Logistic 2/3 Test
2 2/3 3
805 79 1
3 70 194
2 Sens 90.1% Not assigned CS 12.9% 3 Sens 72.7%
2 FP rate 0.4% 3 FP rate 0.5%
Accuracy 99.6% the accuracy value excludes those peptides assigned to 2/3, but these values are included when calculating sensitivity. aNote
238
5847_03_p233-250
9/22/05
1:05 PM
Page 239
CHASTE FOR TANDEM MASS SPECTROMETRY PROTEOMICS precursor m/z , was utilized. The use of a minimum threshold of 20% resulted in the highest accuracy (99.8%) for determination of 1 charge state spectra based on the training set data (Table 2). Using a threshold of 15% resulted in slightly worse accuracy (99.6%) but was included because 20% was a surprisingly large threshold and the test based on number of peaks (threshold of 2 peaks) above 5% relative intensity resulted in lower accuracy (99.0%). In the 2/3 tests, to make the distribution of the difference in complementary matches between 2 and 3 charge state peptides symmetric around 0 (2 mean and 3 mean), 5 was added to the number of matches assuming a 2 charge state. This is necessary since matches at the 3 charge state can occur with both singly and doubly charged fragments compared to only singly charged fragments at the 2 charge state. As seen in Table 3, if peptides with a difference (2 matches plus 5 minus 3 matches) greater then or equal to 3 are assigned to charge state 2 and peptides with a difference less than or equal to 4 are assigned to 3, the resulting error rates for 2 and 3 assignments were less than 1% (0.4% for 2 and 0.6% for 3). Peptides with differences in between these thresholds would be considered ambiguous and would be searched at both charge states. Figures 1–3 demonstrate the three possible scenarios for precursor charge state selection based on the difference in complementary ion matches between the 2 and 3 charge states. Figure 1 shows a MS/MS spectrum of a doubly charged peptide with a m/z of 875.72. Assuming the dissociated peptide ion was doubly charged, the total number of matched complementary ions was 9; while assuming the peptide was triply charged matches, the total number of matched complementary ions was only 7. The difference in complementary ions assigns this MS/MS spectrum as a doubly charged precur-
FIG. 1. MS/MS spectrum of a doubly charged peptide with a m/z of 875.72. The matched complementary ions for each possible precursor ion charge state are labeled. Fragment ions involved in multiple complementary ion pairs are only labeled once. The charge state is correctly identified as 2 as the number of complementary ions pairs matching the doubly charged peptide are above the decision threshold.
239
5847_03_p233-250
9/22/05
1:05 PM
Page 240
HOGAN ET AL.
FIG. 2. MS/MS spectrum of a triply charged peptide with a m/z of 547.98. The matched complementary ions for each possible precursor ion charge state are labeled. Fragment ions involved in multiple complementary ion pairs are only labeled once. The charge state is correctly identified as 3 as the number of complementary ions pairs matching the triply charged peptide are above the decision threshold.
sor ((9 5) 7 7). The most abundant fragment ion in the spectrum was determined to result from the loss of a neutral water molecule, after manual inspection of the mass difference of 9 Da between the fragment ion and the dissociated precursor ion. This fragment ion is not matched to any complement ion as the fragment ion is doubly charged. A case demonstrating a triply charged precursor ion with a m/z of 547.98 is shown in Figure 2. This spectrum contains a larger number of matched complementary ions related to a triply charged (14) than a doubly charged precursor (4). Figure 3 shows the MS/MS spectra of a precursor ion with a m/z of 432.11. This spectrum demonstrates a case where the total number of complementary fragment ions assigned to both a doubly and triply charge precursor ion falls within the decision thresholds. As a result, both the 2 and 3 charge states would be searched using SEQUEST and estimated by the LIPS model. The difference in complimentary ion matches was combined with the percentage of fragment ion intensity greater the precursor m/z and twice the precursor m/z, using a logistic regression model fit to the training data. This approach is similar to the approach described in equations (1) and (2) for the LIPS models. All these predictors significantly improved the fit to the training data after accounting for the others (p-values all less than 0.001). The resulting model: Lcs 8.88 14.19 X1 29.38 X2 0.99 X3
(3)
eLcs Pcs 1 eLcs
(4)
and
240
5847_03_p233-250
9/22/05
1:05 PM
Page 241
CHASTE FOR TANDEM MASS SPECTROMETRY PROTEOMICS
FIG. 3. MS/MS spectrum of a triply charged peptide with a m/z of 432.11. The matched complementary ions for each possible precursor ion charge state are labeled. Fragment ions involved in multiple complementary ion pairs are only labeled once. The charge state is unable to be determined since the number of complementary ions pairs are within the decision threshold.
where X1 % intensity greater than precursor m/z, X2 % intensity greater than twice precursor m/z, X3 ((2 matches 5) 3 matches), and pCS is the estimated probability of 3 charge state. Thresholds were chosen to match the error rates of the difference in complimentary matches (pCS 0.96 is 3 charge state, pCS 0.06 is 2 charge state, and otherwise charge state is not assigned). As seen in Table 3 the logistic model increased sensitivity for 2 and 3 identification and thereby reduced the percentage of unassigned charge states from 19.4% to 12.9%. Examples of this model can be seen in Figures 1–3, where the 2 spectrum (Fig. 1) has fitted probability pCS near 0, the 3 spectra (Fig. 2) has a probability near 1, and the last spectrum (Fig. 3) has an ambiguous pCS value of 0.24.
Validation of charge state tests Table 4 shows the results of the different 1 tests obtained on our first test dataset. A minimum percentage intensity threshold of 15% of the total ion intensity performed slightly better than using either a 20% minimum intensity threshold or a threshold requiring no more than one fragment ion peak above the precursor m/z to have an intensity greater than 5% relative abundance, when using the same standard criterion as in the training dataset. Lessening the standard criterion for known charge states to include lower quality matches to the known protein or peptide standards appears reduce the accuracy of the 1 tests. Accuracy declines from 99.3% to 96.9% using the 15% threshold. However, part of this decline is due to random peptide matches of known standard proteins or peptides searched for an incorrect charge state. To es241
5847_03_p233-250
9/22/05
1:05 PM
Page 242
TABLE 4.
ACCURACY
OF THE
1 TEST
ON THE
Matched to Protein or Peptide Standard and LIPS Probability 0.9 ActualCS Predicted CS 1 2/3 20% GT PC
1 2/3
120 2
FIRST TEST DATASET Sensitivity, FP rate and Accuracy
1 Sens 98.1% 1 FP rate 9.1% 2/3 Sens 2/3 FP rate 0.2% 99.1%
12 1328
Accuracy 99.0% 15% GT PC
1 2/3
94 4
1 Sens 95.9% 2/3 Sens 99.9%
1 1151
1 FP rate 1.1% 2/3 FP rate 0.3%
Accuracy 99.6% 2 peaks 5%
1 2/3
96 2
1 Sens 98.0% 2/3 Sens 99.1%
10 1142
1 FP rate 9.7% 2/3 FP rate 1.7%
Accuracy 99.0% Matched to Protein or Peptide Standard Only Actual CS Predicted CS 1 2/3 20% GT PC
1 2/3
417 27
Sensitivity, FP rate and Accuracy 1 Sens 93.9% 1 FP rate 9.2% 2/3 Sens 2/3 FP rate 1.6% 97.5%
42 1672
Accuracy 96.8% 15% GT PC
1 2/3
411 33
1 Sens 92.6% 1 FP rate 7.4% 2/3 Sens 2/3 FP rate 1.9% 98.1%
33 1681
Accuracy 96.9% 2 peaks 5%
1 2/3
408 36
1 Sens 91.7% 1 FP rate 8.3% 2/3 Sens 2/3 FP rate 1.5% 98.4%
37 1677
Accuracy 96.6% Matched to Protein or Peptide Standard Adjusted for Random Matches Actual CS Predicted CS 1 2/3 20% GT PC
1 2/3
430 7
Sensitivity, FP rate and Accuracy
1 Sens 98.4% 1 FP rate 6.3% 2/3 Sens 2/3 FP rate 0.4% 98.3%
29 1692
Accuracy 98.3% 15% GT PC
1 2/3
424 13
1 Sens 97.0% 1 FP rate 4.5% 2/3 Sens 2/3 FP rate 0.8% 98.8%
20 1701
Accuracy 98.5% 2 peaks 5%
1 2/3
421 16
1 Sens 96.3% 1 FP rate 5.4% 2/3 Sens 2/3 FP rate 0.9% 98.6%
24 1697
Accuracy 98.1%
242
5847_03_p233-250
9/22/05
1:05 PM
Page 243
CHASTE FOR TANDEM MASS SPECTROMETRY PROTEOMICS TABLE 5.
ACCURACY
OF THE
2/3 TEST
ON THE
Matched to Protein or Peptide Standard and LIPS Probability 0.9 Actual CS Predicted CS 2 3 2/3 Test
2 2/3 3
957 159 5
FIRST TEST DATASETa Sensitivity, FP rate and Accuracy 2 Sens 85.3% 2 FP rate 0.1% Not assigned CS 14.2% 3 Sens 74.3% 3 FP rate 3.1%
1 53 153
Accuracy 99.0% Logistic 2/3 Test
2 2/3 3
1029 102 2
2 Sens 90.8% 2 FP rate 0.2% Not assigned CS 11.1% 3 Sens 76.3% 3 FP rate 1.3%
2 47 158
Accuracy 99.7% Matched to Protein or Peptide Standard Only Actual CS Predicted CS 2 3 2/3 Test
2 2/3 3
1121 249 9
Sensitivity, FP rate and Accuracy 2 Sens 81.3% 2 FP rate 1.3% Not assigned CS 20.6% 3 Sens 62.1% 3 FP rate 4.7%
15 96 182
Accuracy 98.2% Logistic 2/3 Test
2 2/3 3
1247 164 7
2 Sens 88.0% 2 FP rate 1.7% Not assigned CS 13.8% 3 Sens 67.6% 3 FP rate 3.5%
22 74 200
Accuracy 98.0% aNote the accuracy value excludes those peptides assigned to 2/3, but these values are included when calculating sensitivity.
timate how many of these random matches may have occurred, the MS/MS runs were also searched against a reversed sequence database. This resulted in 69 random matches to our protein or peptide standards (30 at 1 charge state, 21 at 2 charge state, and 18 at 3 charge state. We assumed that randomly matched peptides would match any of the 3 charge states with equal probability, thus, two thirds of the 1 matches are actually 2 or 3 and one third of the 2/3 matches are actually 1. Also, since the 1 test has accuracy close to 1, the actual charge state of these random peptides was identified correctly by the 1 test. Thus, 1 false positives are reduced by 20, and 2/3 false positives are reduced by 13. As Table 4 indicates, under these assumptions, the accuracy is much less reduced and only declines to 98.5% in the case of the 15% threshold. Accuracy of the 2/3 test in the first test dataset is quite similar to the training set as can be seen in Table 5. This is the case for both the complimentary matches and the logistic regression model. There is a similar reduction in the percentage of unassigned charge states as well. Allowing lower quality matches reduces accuracy of the test again, although not to the extent of the 1 test. Some of the false positives here are also likely due to random matches to the protein or peptide standards. The percentage of unassigned charge states also increases with lower quality matches, but once again the logistic model has fewer of these unassigned charge state with a similar false positive rate. Very similar results were obtained by validating the charge state test on the second dataset (six LCQ MS/MS runs; Tables 6 and 7), with the 15% threshold being generally the best. Accuracy on the third test dataset (1 LTQ run) was slightly lower, particularly with the 20% threshold (Tables 8 and 9). (This reiterates that thresh243
5847_03_p233-250
9/22/05
1:05 PM
Page 244
HOGAN ET AL. TABLE 6.
ACCURACY
OF THE
1 TEST
ON THE
Matched to Protein or Peptide Standard and LIPS Probability 0.9 Actual CS Predicted CS 2 3 20% GT PC
1
63
5
2/3
0
949
SECOND TEST DATASET Sensitivity, FP rate and Accuracy 1 Sens 100.0% 2/3 Sens 99.5%
1 FP rate 7.4%
2/3 FP rate 0.0%
Accuracy 99.6% 15% GT PC
1
63
4
2/3
0
950
1 Sens 100.0% 2/3 Sens 99.6%
1 FP rate 5.9%
2/3 FP rate 0.0%
Accuracy 99.7% 2 peaks 5%
1 2/3
49 14
1 Sens 77.8% 1 FP rate 9.2% 2/3 Sens 2/3 FP rate 1.5% 99.5%
5 949
Accuracy 98.3% Matched to Protein or Peptide Standard Only Actual CS Predicted CS 1 2/3 20% GT PC
1 2/3
198 3
Sensitivity, FP rate and Accuracy 1 Sens 98.5% 1 FP rate 13.9% 2/3 Sens 2/3 FP rate 0.3% 97.4%
32 1179
Accuracy 97.5% 15% GT PC
1 2/3
198 3
1 Sens 98.5% 1 FP rate 10.8% 2/3 Sens 2/3 FP rate 0.3% 98.0%
24 1171
Accuracy 98.1% 2 peaks 5%
1 2/3
158 43
1 Sens 78.6% 1 FP rate 9.7% 2/3 Sens 2/3 FP rate 3.6% 98.6%
17 1167
Accuracy 96.4%
olds are best chosen specific to the particular instrument, protocol, and pipeline.) The logistic regression model, based on the 2/3 test, reduces the number of unassigned charge states in both of these situations as well. Finally, this model is relatively skewed towards 2 charge state identifications indicating that thresholds (and even model parameters) should be calibrated for the particular analysis situation.
Reduction of number of DTA files In the first test dataset there were 7,536 MS/MS spectra leading to 22,382 DTA files. Using the 15% threshold, 2,278 spectra were assigned to a 1 charge state and the rest to 2, 3, or 2/3 charge states, 244
5847_03_p233-250
9/22/05
1:05 PM
Page 245
CHASTE FOR TANDEM MASS SPECTROMETRY PROTEOMICS TABLE 7.
ACCURACY
OF THE
2/3 TEST
ON THE
Matched to Protein or Peptide Standard and LIPS Probability 0.9 Actual CS Predicted CS 2 3 2/3 Test
2 2/3 3
581 169 3
7 72 122
SECOND TEST DATASETa Sensitivity, FP rate and Accuracy 2 Sens 77.2% 2 FP rate 1.2% Not assigned CS 25.2% 3 Sens 60.7% 3 FP rate 2.4% Accuracy 98.6%
Logistic 2/3 Test
2 2/3 3
651 101 1
7 95 99
2 Sens 86.6% 2 FP rate 2.0% Not assigned CS 20.8% 3 Sens 49.3% 2 FP rate 1.0% Accuracy 98.9%
Matched to Protein or Peptide Standard Only Actual CS Predicted CS 2 2/3 Test
2 2/3 3
704 236 5
Sensitivity, FP rate and Accuracy 3 17 101 141
2 Sens 74.4% 2 FP rate 2.4% Not assigned CS 25.6% 3 Sens 54.7% 2 FP rate 2.4% Accuracy 97.7%
Logistic
2 2/3 3
824 119 1
18 124 117
2 Sens 87.3% 2 FP rate 2.1% Not assigned CS 18.8% 3 Sens 45.2% 2 FP rate 0.9% Accuracy 98.1%
aNote the accuracy value excludes those peptides assigned to 2/3, but these values are included when calculating sensitivity.
leaving 12,678 DTA files (44.5% reduction). Of the remaining 5,258 spectra, 2,573 were assigned to a 2 charge state, 569 were assigned to a 3 charge state, and the rest were left as 2 or 3 charge states, based on the 2/3 logistic regression model, leaving 8,789 DTA files (further 16.2% reduction). The logistic regression model removed an additional 907 DTA files over the test based on complimentary peak matching alone. Altogether, the total reduction in the number of DTA files was 60.7% and thus the number of spectra that need to be redundantly searched at multiple charge states was reduced by 94.0%. Similar reductions in the number of DTA files to be searched were achieved with the second and third test datasets: 61.5% and 64.3%, respectively (94.8 and 97.3% reductions in redundant searches).
Impact of charge state tests on peptide identification The addition of the 1 test as predictor to our earlier LIPS model of the peptide identification significantly improved the fit of this model (p-value 0.0001). The further addition of the 2/3 test improved this LIPS model again (p-value 0.0001). The predictive value of adding the charge state tests as predictors or screening by the tests relative to our earlier base LIPS model searching at all 3 charge states can be seen in the ROC curves in Figure 4. They show that the 1 test noticeably improves the predictive ability of the LIPS model, but the further addition of the 2/3 test offers only a slight improvement in its predictive ability. Screening by the charge state tests does cause some minor loss in sensitivity (loss of some true positives) and only at very high rates of specificity (low false positive rates). For example, if we were to screen matches us245
5847_03_p233-250
9/22/05
1:05 PM
Page 246
HOGAN ET AL. TABLE 8.
ACCURACY
OF THE
1 TEST
ON THE
Matched to Protein or Peptide Standard and LIPS Probability 0.9 Actual CS Predicted CS 1 2/3 20% GT PC
1 2/3
146 2
THIRD TEST DATASET Sensitivity, FP rate and Accuracy
1 Sens 98.6% 1 FP rate 17.5% 2/3 Sens 2/3 FP rate 0.2% 96.5%
31 858
Accuracy 96.8% 15% GT PC 15% GT PC
1 2/3
145 3
1 Sens 98.0% 1 FP rate 6.4% 2/3 Sens 2/3 FP rate 0.3% 98.9%
10 879
Accuracy 98.7%
2 peaks 5% 2 peaks 5%
1 2/3
132 16
1 Sens 89.2% 1 FP rate 7.0% 2/3 Sens 2/3 FP rate 1.8% 96.5%
10 879
Accuracy 97.5% Matched to Protein or Peptide Standard Only Actual CS Predicted CS 1 2/3 20% GT PC
1 2/3
324 27
Sensitivity, FP rate and Accuracy 1 Sens 92.3% 1 FP rate 26.7% 2/3 Sens 2/3 FP rate 2.1% 91.1%
118 1205
Accuracy 91.2% 15% GT PC 15% GT PC
1 2/3
313 38
1 Sens 89.2% 1 FP rate 18.0% 2/3 Sens 2/3 FP rate 2.9% 94.8%
69 1254
Accuracy 93.5%
2 peaks 5% 2 peaks 5%
1 2/3
279 72
1 Sens 79.5% 1 FP rate 12.8% 2/3 Sens 2/3 FP rate 5.3% 96.9%
41 1282
Accuracy 93.2%
ing LIPS probabilities above 0.9 the results are quite similar: standard base model results in 1,456 correct peptide identifications with 17 false positives, while screening with CHASTE results in only 12 fewer correct peptide identifications with 2 fewer false positives as well. Using CHASTE as a predictor resulted in more correct identifications (1,522) at this threshold, but also slightly more false positives (22).
DISCUSSION As the results indicate, CHASTE is able to assign charge states with good accuracy to most of the 1, 2, and 3 MS/MS spectra generated by several different ion trap instruments. This reduces by at least 246
5847_03_p233-250
9/22/05
1:05 PM
Page 247
CHASTE FOR TANDEM MASS SPECTROMETRY PROTEOMICS TABLE 9.
ACCURACY
OF THE
2/3 TEST
ON THE
Matched to Protein or Peptide Standard and LIPS Probability 0.9 Actual CS Predicted CS 2 3 2/3 Test
2 2/3 3
538 174 10
2 52 113
THIRD TEST DATASETa Sensitivity, FP rate and Accuracy
2 Sens 86.5% 2 FP rate 0.4% Not assigned CS 25.4% 3 Sens 67.7% 3 FP rate 8.1% Accuracy 98.2%
Logistic 2/3 Test
2 2/3 3
672 48 2
9 85 73
2 Sens 93.1% 2 FP rate 1.3% Not assigned CS 15.0% 3 Sens 43.7% 3 FP rate 2.6% Accuracy 98.5%
Matched to Protein or Peptide Standard Only Actual CS 2/3 Test
Sensitivity, FP rate and Accuracy
Predicted CS
2
3
2 2/3 3
735 315 21
21 98 133
2 Sens 68.6% 2 FP rate 2.7% Not assigned CS 31.2% 3 Sens 52.8% 3 FP rate 13.6% Accuracy 95.4%
Logistic 2/3 Test
2 2/3 3
973 94 4
53 115 84
2 Sens 90.9% 2 FP rate 5.2% Not assigned CS 15.3% 3 Sens 33.3% 3 FP rate 4.5% Accuracy 94.9%
aNote the accuracy value excludes those peptides assigned to 2/3, but these values are included when calculating sensitivity.
60% the number of spectra that, in our case, SEQUEST is required to search and the number of redundant searches by over 90% with little or no reduction in the number of peptides identified or in the accuracy of those identifications. Identification of 1 charge state spectra can be done with good accuracy using only a simple measure of the percentage of intensity greater than the precursor m/z. The difference in the number of complimentary matching fragment ions provides very good discrimination between the 2 and 3 charge states. This cannot always completely separate them with high certainty, so some fraction needs to be left as undetermined. Since there are more possible complimentary fragments for a 3 peptide (1 and 2 fragments versus only 1) adding 5 to the 2 as a cosmetic correction makes the distributions of the difference more symmetric around 0. Combining complimentary match difference data with the distribution of peak intensity (percentage of the precursor m/z and twice the precursor m/z) in a logistic regression model improves discrimination of 2 and 3 charge states and thus reduces the number of undetermined charge states without increasing the error rate. CHASTE’s ability to discriminate between the charge states decreased somewhat when lower quality peptide matches were included in the validation set. This, however, did not impact peptide identification, since these lower quality matches are not useful for identifying correct peptides. The results were consistent using different LCQ runs from two different labs, but the accuracy was a bit lower for the LTQ test dataset when using the CHASTE model calibrated to our own LCQ training dataset. This reiterates to the need to calibrate the models for specific instruments, labs, protocols and organisms. Running known standards on the specific machine with the specific protocol and database can be easily used to generate data to train and validate the model. Using high confidence peptide identifications (such 247
5847_03_p233-250
9/22/05
1:05 PM
Page 248
HOGAN ET AL.
FIG. 4. ROC curves for models including charge state (CS) tests as predictors in LIPS models and screening by CS tests compared to the standard base LIPS model. (A) Slight increase in predictive ability of adding CHASTE as predictor and the slight drop in sensitivity only at lower specificity when screening by CHASTE. (B) Enlarged top-left corner (high sensitivity and low specificity) insert of the top figure.
248
5847_03_p233-250
9/22/05
1:05 PM
Page 249
CHASTE FOR TANDEM MASS SPECTROMETRY PROTEOMICS as those with LIPS probabilities 0.9) can create a gold standard dataset of known charge states to adjust thresholds and estimate parameters in logistic regression models. CHASTE was trained and validated using MS/MS spectra from the most commonly used ion trap mass spectrometers for high-throughput peptide/protein identification, the LCQ and LTQ, combined with the most commonly used search algorithm SEQUEST. However, CHASTE is not at all dependent on the database search algorithm, and should apply equally well to other search algorithms and to de novo sequencing approaches. The principals of CHASTE should also apply to any ion trap or triple quadrupole instrument, as long as the algorithm is trained and validated with respect to that instrument, using known standard protein mixtures. Finally, CHASTE was implemented as Java GUI-based and command-line-based interfaces, available free for research purposes.
CONCLUSION CHASTE provides a simple and accurate approach for estimating charge states of peptide MS/MS spectra from mass spectrometers lacking sufficient resolution to directly determine the charge state from the isotope spacing. The utility of CHASTE was demonstrated using datasets comprised of fully characterized MS/MS spectra and unknown MS/MS spectra from repetitive LC/MS/MS runs of a standard peptide and protein mixture obtained on Thermo Electron’s LCQ and LTQ ion trap mass spectrometers. This greatly reduces the need to repetitively search MS/MS spectra at different charge states, thereby helping alleviate one the major bottlenecks in high-throughput peptide and protein identification. CHASTE relies on straightforward measures of fragment ion peak distributions and combines different measures to improve accuracy using reliable statistical models (logistic regression). Thresholds and parameter estimates can be tailored and validated to specific analysis situation by using data gathered on known protein standards.
ACKNOWLEDGMENTS We greatly appreciate suggestions and comments of Gerald van Belle, Paul Edlefsen, members of the Shewanella Federation, and especially Jim Fredrickson. We also appreciate expertise and efforts of the Pacific Northwest National Laboratory’s EMSL Proteomics Facility and specifically Richard Smith, Gordon Anderson, Mary Lipton, Ken Auberry, and Sam Purvine for compiling the LCQ datasets and Thermo Electron Corporation’s (San Jose, CA) Vlad Zabrouskov and Kevin Wheeler for compiling the LTQ dataset. The Department of Energy’s OBER and OASCR Genomics: GTL program under grant DE-FG08-01ER63218 to E.K supported this research.
REFERENCES AEBERSOLD, R., and GOODLETT, D.R. (2001). Mass spectrometry in proteomics. Chem Rev 101, 269–295. COLINGE, J., MAGNIN, J., DESSINGY, T., et al. (2003). Improved peptide charge state assignment. Proteomics 3, 1434–1440. DANCIK, V., ADDONA, T.A., CLAUSER, K.R., et al. (1999). De novo peptide sequencing via tandem mass spectrometry. J Comput Biol 6, 327–342. ENG, J.K., MCCORMACK, A.L., and YATES, J.R. (1994). An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom 5, 976–989. FENN, J.B., MANN, M., MENG, C.K., et al. (1989). Electrospray ionization for mass spectrometry of large biomolecules. Science 246, 64–71. HEIDELBERG, J.F., PAULSEN, I.T., NELSON, K.E., et al. (2002). Genome sequence of the dissimilatory metal ionreducing bacterium Shewanella oneidensis. Nat Biotechnol 20, 1118–1123. HIGDON, R., KOLKER, N., PICONE, A., et al. (2004). LIP index for peptide classification using MS/MS and SEQUEST search via logistic regression. OMICS 8, 357–369. KARAS, M., and HILLENKAMP, F. (1988). Laser desorption ionization of proteins with molecular masses exceeding 10,000 daltons. Anal Chem 60, 2299–2301.
249
5847_03_p233-250
9/22/05
1:05 PM
Page 250
HOGAN ET AL. KLAMMER, A.A., WU, C.W., MACCOSS, M.J. and NOBLE, W.S. (2005). Peptide charge state determination for lowresolution tandem mass spectra. Proceedings of the Computational Systems Bioinformatics Conference, August 8–11, 2005, Stanford, CA. pp. 175–185. KOLKER, E., PURVINE, S., GALPERIN, M.Y., et al. (2003). Initial proteome analysis of model microorganism Haemophilus influenzae strain Rd KW20. J Bacteriol 185, 4593–4602. KOLKER, E., PICONE, A.F., GALPERIN, M.Y., et al. (2005). Global profiling of Shewanella oneidensis MR-1: expression of hypothetical genes and improved functional annotations. Proc Natl Acad Sci USA 102, 2099–2104. LEIBLER, D.C. (2002). Introduction to Proteomics (Humana Press, Totowa, NJ). MACCOSS, M.J., WU, C.C., and YATES, J.R., 3RD (2002). Probability-based validation of protein identifications using a modified SEQUEST algorithm. Anal Chem 74, 5593–5599. MCCULLAGH, P., and NELDER, J.A. (1999). Generalized Linear Models (Chapman Hall, London). PEPE, M.S. (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction. (Oxford University Press, Oxford). PERKINS, D.N., PAPPIN, D.J., CREASY, D.M., et al. (1999). Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551–3567. PURVINE, S., KOLKER, N., and KOLKER, E. (2004a). Spectral quality assessment for high throughput tandem mass spectrometry proteomics. OMICS 8, 255–265. PURVINE, S., PICONE, A.F., and KOLKER, E. (2004b). Standard mixtures for proteome studies. OMICS 8, 79–92. SADYGOV, R.G., ENG, J., DURR, E., et al. (2002). Code developments to improve the efficiency of automated MS/MS spectra interpretation. J Proteome Res 1, 211–215. TABB, D.L., ENG, J.K., and YATES, J.R., 3RD (2001). Mass spectrometry. In Proteome Research (Springer, New York), pp. 125–142. VAN BERKEL, G.J., ASANO, K.G., and SCHNIER, P.D. (2001). Electrochemical processes in a wire-in-a-capillary bulk-loaded, nano-electrospray emitter. J Am Soc Mass Spectrom 12, 853–862. WASHBURN, M.P., WOLTERS, D., and YATES, J.R., 3RD (2001). Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat Biotechnol 19, 242–247.
Address reprint requests to: Dr. Eugene Kolker BIATECH Non-Profit Research Center 19310 N. Creek Pkwy., Ste. 115 Bothell, WA 98011 E-mail:
[email protected]
250