MassWiz: A Novel Scoring Algorithm with Target ... - ACS Publications

13 downloads 15125 Views 4MB Size Report
Mar 18, 2011 - All mzXML files were converted to mascot generic format. (mgf) using ... (TORNADO) results were parsed from the XML files using a ... TOF PSD.
ARTICLE pubs.acs.org/jpr

MassWiz: A Novel Scoring Algorithm with Target-Decoy Based Analysis Pipeline for Tandem Mass Spectrometry Amit Kumar Yadav, Dhirendra Kumar, and Debasis Dash* Institute of Genomics and Integrative Biology (CSIR), Mall Road, Delhi, India

bS Supporting Information ABSTRACT: Mass spectrometry has made rapid advances in the recent past and has become the preferred method for proteomics. Although many open source algorithms for peptide identification exist, such as X!Tandem and OMSSA, it has majorly been a domain of proprietary software. There is a need for better, freely available, and configurable algorithms that can help in identifying the correct peptides while keeping the false positives to a minimum. We have developed MassWiz, a novel empirical scoring function that gives appropriate weights to major ions, continuity of b-y ions, intensities, and the supporting neutral losses based on the instrument type. We tested MassWiz accuracy on 486,882 spectra from a standard mixture of 18 proteins generated on 6 different instruments downloaded from the Seattle Proteome Center public repository. We compared the MassWiz algorithm with Mascot, Sequest, OMSSA, and X!Tandem at 1% FDR. MassWiz outperformed all in the largest data set (AGILENT XCT) and was second only to Mascot in the other data sets. MassWiz showed good performance in the analysis of high confidence peptides, i.e., those identified by at least three algorithms. We also analyzed a yeast data set containing 106,133 spectra downloaded from the NCBI Peptidome repository and got similar results. The results demonstrate that MassWiz is an effective algorithm for high-confidence peptide identification without compromising on the number of assignments. MassWiz is open-source, versatile, and easily configurable. KEYWORDS: Tandem mass spectrometry, proteomics, peptide identification, bioinformatics, open source, algorithm, FDR, MS/ MS

’ INTRODUCTION With the advent of soft ionization techniques like MALDI1 and ESI,2 it became possible to ionize highly polar and nonvolatile molecules such as peptides without destroying them. They could now be introduced into a mass spectrometer, making analysis of peptides a lot easier. Sequence database searching emerged as a valuable alternative to de novo sequencing. Due to the rapid advances made in MS instrumentation (LTQ , QTOF, FTICR, Orbitrap, etc.), availability of complete genome sequences, increased computational power for data analyses, and development of algorithms mass spectrometry has become the method of choice for proteomics studies.3,4 Washburn et al.5 showed the applicability of high throughput capability of LCMS approach in the yeast proteome establishing shotgun proteomics as a valuable methodology. There have been improvements in bioinformatics tools and algorithms for signal processing and peak detection,68 charge state deconvolution,9,10 noise removal8,11 and spectra filtering,11,12 database searches and assigning statistical confidence.1316 Due to the various steps involved in data analysis and their complexity, no single method can be a complete solution.17 There is a lot of scope for newer bioinformatics methods and algorithms, especially those available freely in the public domain for rapid advancement of the field. Tools such as k-score plugin18 into X!Tandem, the r 2011 American Chemical Society

Trans-Proteomics Pipeline (TPP),19 InsPecT,20 etc. are some of the excellent examples. A robust scoring function is the heart of any peptide identification algorithm. The scoring functions can be broadly divided into probabilistic and empirical scoring schemes. Mascot21 is one of the most widely used probability based algorithm, whereas SEQUEST22 is based on cross-correlation between theoretical and experimental spectrum. X!Tandem23 uses a hyper geometric model, and OMSSA24 relies on a Poisson distribution to assess the significance of matches. While all algorithms have their inherent pros and cons, any single method cannot capture all of the information content from an MS experiment.25 It has been generally agreed that using multiple algorithms increases the number of assignments.17,26 We present a novel empirical scoring algorithm that aims to maximize the identifications while keeping the false positives (incorrect identifications) to a minimum. Our scoring function assigns different weights to key ions, their consecutive occurrence, their intensities, and their supporting ions. Significance of intensity as a parameter has been previously shown;27,28 it helps discriminate between a correct and a random match. For developing and testing the scoring function, we needed an easily Received: July 19, 2010 Published: March 18, 2011 2154

dx.doi.org/10.1021/pr200031z | J. Proteome Res. 2011, 10, 2154–2160

Journal of Proteome Research modifiable framework. So, we developed the required framework in Perl, which was easy to implement and modify. Although it may not be comparable with the existing algorithms in timeperformance, it can still be very useful as Perl code can easily be modified to tweak the algorithm. We benchmarked MassWiz accuracy with Mascot, Sequest, X!Tandem, and OMSSA by comparing the number of identified high-confidence peptides from a standard mixture of 18 proteins. Decoy methods2931 have become popular for estimation of false discovery rates (FDR). Although Moore et al.32 first used the method by simply reversing the target database, many alternatives have been suggested.33,34 It is now used for assigning significance to peptide identifications at a fixed FDR value.31 We have integrated the reverse database decoy strategy for significance assessment that is free from distribution assumptions and does not require curve-fitting. MassWiz executable is available on sourceforge (http://sourceforge.net/projects/masswiz), and the source code is available freely for academic use on request.

’ MATERIALS AND METHODS Data Set

A data set of standard mixture of 18 proteins, “ISB standard protein mix” described by Klimek et al.,35 was used for validating MassWiz and comparing its accuracy against other algorithms. The Mix 3 data set for all six instruments was downloaded from http://regis-web.systemsbiology.net/PublicData sets in mzXML format. The FASTA database (database of 18 proteins mix, contaminants, and Haemophilus influenzae sequences) was also downloaded and updated with recent sequences for all standard proteins and their homologues. For testing MassWiz on a biological data set, we downloaded yeast mid-log phase data from the NCBI Peptidome repository (http://www.ncbi.nlm.nih.gov/peptidome/psm1001). The FASTA database was downloaded from Swissprot using taxonomy filter Saccharomyces cerevisiae (Baker’s yeast) [4932] complete proteome containing 6616 sequences. Input Data Preparation

All mzXML files were converted to mascot generic format (mgf) using MzXML2search executable from TPP. For each instrument, all mgf files thus obtained were concatenated and used as a common search input to all algorithms. Algorithm Implementation

The scoring algorithm was tested by developing a framework in Perl (version 5.10.1). The mass calculations and theoretical spectrum generation was accomplished using the InSilicoSpectro package.36 The MassWiz framework includes a complete pipeline from handling the input spectra to generating FDR corrected peptide spectrum matches (PSMs), i.e., top ranked peptide for each spectrum. Spectral Processing

Any peptide identification algorithm is only as good as the quality of data it receives. Spectral quality is of great importance for any algorithm to perform at its optimal level. Several studies have been dedicated to spectral quality assessment12,37,38 to obtain better results from search algorithms. Most algorithms have inbuilt filtering mechanism to remove noise peaks and bad spectra from the input raw data. We have employed a simple yet effective filter to perform this task. A spectrum is dynamically divided into mass-bins based on its precursor mass, and a

ARTICLE

maximum of five most intense peaks are picked from every bin to have better peak coverage from all parts of the spectrum. A minimum intensity threshold can be set for a peak to be considered as signal. Peaks below this are considered noise and deleted before search. Similarly, the minimum number of peaks can be defined for a spectrum to be considered for search. This reduces random matches and saves time, thus improving sensitivity and efficiency of the algorithm. Not much is known about the peak filtering step of Mascot. Sequest’s cross correlation takes care of the spectrum quality. OMSSA applies an intensity threshold cutoff, and X!Tandem uses a maximum of 50 peaks for search by default. The peak intensity filters were not used so as to compare all algorithms on complete data, irrespective of the spectra quality. MS/MS Database Search

The mgf files were searched using the updated database and its reversed database for target-decoy based FDR calculation. The search parameters were matched as close as possible to those described in the original paper,35 and defaults were taken where this was not possible. Searches were performed with precursor ion tolerance of 3 Da, product ion tolerance of 1 Da, trypsin digestion with 1 missed cleavage, a fixed modification of þ57.03 Da (carbamidomethylation) at cysteine residues, maximum charge þ7, minimum 5 peaks, and peak intensity threshold set to zero. For the yeast data set from ESI-TRAP, a 3 Da error window was allowed for precursors while fragment masses were allowed to be matched at 0.6 Da. Tryptic digestion with 1 missed cleavage was considered with carbamidomethylation as the fixed modification and oxidation of methionine residues as variable modification for the search. The other parameters were same as above. Target-decoy searches and FDR calculation are integrated into MassWiz framework. Once a search is complete, we get the target, decoy, and FDR corrected files as output. Mascot was searched using locally installed Mascot server version 2.2.04. The target and decoy results were exported as csv without any p-value filters for all PSMs and FDR was calculated. Sequest searches and result extraction were conducted using Thermo’s Proteome Discoverer 1.1 interface. All rank 1 PSMs were exported to excel sheets for FDR calculation. X!Tandem (TORNADO) results were parsed from the XML files using a Perl program. From these files, FDR was calculated and FDR corrected PSMs were written to an output file using another Perl program. OMSSA (2.1.9) results were obtained as csv files from which FDR was calculated and output files were written using a Perl program.

False Discovery Rate Calculation

The false discovery rate was calculated using Kall’s method.31 The decoy peptides that had identical corresponding peptides in the target database were ignored from decoy results during FDR calculation. Leu/Ile were considered indistinguishable and treated as identical. FDR was calculated from database search scores wherever possible. FDR ¼

no: of decoy PSMs above threshold no: of target PSMs above threshold

The target and decoy scores were sorted in descending order and FDR calculated at each decoy score taken as the threshold. The score at which the FDR was calculated to be 1% or immediately below 1% was taken as the score threshold for 1% FDR. For X!Tandem and OMSSA, the e-values were sorted in 2155

dx.doi.org/10.1021/pr200031z |J. Proteome Res. 2011, 10, 2154–2160

Journal of Proteome Research

ARTICLE

Table 1. Scoring Matrix MALDI ion type

ESI

default

TOF/TOF

TOF PSD

QIT-TOF

QUAD-TOF

QUAD-TOF

TRAP

QUAD

FTICR

4-SECTOR

a

y

100

100

100

100

100

100

100

100

100

100

bb

100

100

100

100

100

100

100

100

100

100

ac

50

50

50

50

-

-

-

-

-

50

z

-

-

-

-

-

-

-

-

-

50

immonium

-

100

100

100

100

-

-

-

-

100

y-NH3

25

25

-

25

25

25

25

25

25

-

b-NH3

25

25

25

25

25

25

25

25

25

25

a-NH3 y-H2O

25 -

25 25

25 -

25 25

25

25

25

25

25

-

b-H2O

-

25

25

25

25

25

25

25

25

25

a-H2O

-

25

25

25

-

-

-

-

-

-

a

A bonus score of 50 is awarded for y-ion continuity, and a score of 50 is deducted for discontinuous y-ions. b A bonus score of 20 is awarded for b-ion continuity, and a score of 20 is deducted for discontinuous b-ions. c No score for a-ion continuity/discontinuity. So the value of Cij for a-ions in eq 2 will be zero.

Table 2. Spectra and Peptides Assigned by the Five Algorithms in a Standard Mixture of 18 Proteins and in a Complex Mid-log Phase Yeast Data Set MassWiz

Mascot

instrument

spectra searched

spectra

peptides

spectra

AGILENT XCT

244,174

12074

386

10511

LCQ _Deca

50,986

3522

372

3661

LTQ

79,762

6114

500

LTQ-FT

79,372

19616

QTOF

26,019

3134

Sequest

peptides

OMSSA

X!Tandem

spectra

peptides

spectra

peptides

spectra

peptides

343

11429

357

10516

344

5218

303

382

3164

327

3439

357

2142

283

6240

504

4243

347

5977

469

3598

323

503

20778

539

15052

396

18212

472

11282

374

237

3500

283

2709

207

2976

244

2390

250

(A) Protein Mixture

ABI-4700 TOTAL

6,569

1193

253

1260

263

1236

259

1249

262

1148

237

486,882

45,653

2251

45,950

2314

37,833

1893

42,369

2148

25,778

1770

106,133

7782

877

7917

6004

727

9019

988

5031

646

(B) Yeast

ascending order and FDR was calculated as FDR ¼

no: of decoy PSMs below e-value threshold no: of target PSMs below e-value threshold

Comparison of Algorithms

All algorithms were compared after FDR calculation. A Perl program was written to compare the peptides assigned by the five algorithms.

’ RESULTS AND DISCUSSION Scoring Function

The most important aspect of a mass spectrometry based peptide identification algorithm is developing a robust scoring function. Due to variability in the fragmentation patterns,3941 extent of fragmentation and intensities of the peaks42,43 across runs, instruments and methodologies, the task becomes challenging. We have developed a novel empirical scoring scheme based on the knowledge of ion abundances and their intensities. CID

917

fragmentation patterns have been studied in extensive detail in several studies.42,4447 On the basis of knowledge gained from literature, we experimented with several combinations of scores for the ions based on their known abundances and supportive ions. We arrived at the empirical weights for different ion types depending on the presence in a particular instrument type (Table 1). For matching a spectrum against a candidate peptide P, the score of the peptide is calculated as vffiffiffiffiffiffiffiffiffiffi u k u u Ii u i¼1 ðeq 1Þ scoreðPÞ ¼ SðPÞ 3 u u n t Ii

∑ ∑

i¼1

where P = candidate peptide score(P) = final score for the candidate peptide against the experimental spectrum S(P) = primary score for peptide P (described in detail in eq 2) 2156

dx.doi.org/10.1021/pr200031z |J. Proteome Res. 2011, 10, 2154–2160

Journal of Proteome Research

ARTICLE

Figure 1. Comparison of number of spectra identified by MassWiz, Mascot, Sequest, and OMSSA for data sets from different instruments at 1% FDR.

Figure 3. Comparison of number of (A) spectra and (B) peptides identified by MassWiz, Mascot, Sequest, and OMSSA for mid-log phase yeast data set at 1% FDR.

Figure 2. Comparison of number of peptides identified by MassWiz, Mascot, Sequest, and OMSSA for data sets from different instruments at 1% FDR.

k = number of peaks matched Ii = intensity of the ith peak n = number of peaks in the experimental spectrum (after processing). The term under the square root signifies the matched ion current. It was square root transformed to decrease the effect of intensity irregularities caused by a variable fragmentation pattern and was found to perform better than log transformation. Inclusion of intensity factor in our scoring function increases the resolution of correct assignment over random matches. The fragment mass errors can be very helpful in discerning good matches and has been implemented using an exponential function. In simpler words, the lower the mass error, the better the score for a fragment ion match.   n k Xij þ Cij Nij Wij Qj þ þ þ SðPÞ ¼ jΔmij j jΔmij j jΔmij j jΔmj j e e e e j¼1 i ∈ fy, b, ag j ¼ 1

∑ ∑



ðeq 2Þ where n = total peaks in the theoretical spectrum for a given ion series (y/b/a type ion)

for a given i ∈ y/b/a ion series: Xij = score for the jth peak matched Cij = bonus score for continuity factor when j and j  1 peaks matched and negative score for discontinuous ion series, i.e., when j  1 peak matches but j does not Nij = score for jth matched peak for neutral loss of ammonia (NH3) when Xij 6¼ 0 Wij = score for jth matched peak for neutral loss of water (H2O) when Xij 6¼ 0 Δm = mass difference for the matched fragment peak, i.e., Mexperimental  Mtheoretical k = total peaks in the theoretical spectrum for immonium ion series Qj = score for jth matched peak for Immonium ion These empirical scores are taken from the scoring matrix given in Table 1. The scoring function is adapted to the irregularities of instrument types as it makes extensive use of the information content present in the spectrum along with y- and b-ions. The complementary ions such as neutral losses and immonium ions (depending on the instrument types) can help differentiate between a correct and an incorrect hit when the b-y counts are very close together. Also, the continuity of a series (b/y/a) greatly increases the confidence in the matched ions even when fragmentation is not complete due to partially mobile or nonmobile proton containing peptides. We compared MassWiz with four widely used algorithmsMascot, Sequest, X!Tandem and OMSSA. Six different data sets from ISB standard mixture of 18 proteins and known contaminants 2157

dx.doi.org/10.1021/pr200031z |J. Proteome Res. 2011, 10, 2154–2160

Journal of Proteome Research

ARTICLE

Figure 4. Comparison of number of identified and missed “high-confidence peptides” by MassWiz, Mascot, Sequest, OMSSA and X!Tandem for standard mixture on different instruments (first six data series) and mid-log phase yeast data (last series) at 1% FDR. Peptides identified by any three out of five algorithms are considered as high confidence peptides. 100% corresponds to a pool of high-confidence peptides from the five algorithms.

were searched using all five algorithms with parameters matching as closely as possible. In parameters where we had no control, the defaults were taken. Broadly, all data sets were searched at 3 Da precursor tolerance, 1 Da fragment mass tolerance, tryptic digestion with 1 missed cleavage, and a static modification of carbamidomethylation at cysteine residues. The significance test used by all algorithms differs in terms of the statistical model and assumptions used, so they are not directly comparable. Multiple hypotheses testing correction is accomplished through controlling the FDR at a fixed value. FDR can be easily estimated using a target-decoy based strategy. The algorithms were compared after 1% FDR correction was applied to their results. The number of assigned spectra and unique peptides are shown in Table 2, which depicts the performance of various algorithms on data sets from different instruments. In terms of spectral assignments, MassWiz performs better than Sequest, X! Tandem and OMSSA for all instrument types except ABI-4700 as shown in Figure 1. Between Mascot and MassWiz, the former performs slightly better in a few data sets, while the latter was better in the AGILENT-XCT data set. When we compare the number of uniquely identified peptides by the algorithms, similar trends are observed (Figure 2). Although MassWiz identified 0.65% (297) fewer spectra than Mascot, it identified 7.2% (3284) more than OMSSA, 17.1% (7820) more than Sequest, and 43.5% (19875) more than X!Tandem in the standard mixture (Table 2A). Similarly, it assigned 2.8% (63) fewer peptides than Mascot while assigning 4.6% (103) more than OMSSA, 15.9% (358) more than Sequest, and 21.4% (481) more than X!Tandem. We observed that, apart from identifying new peptides, MassWiz also identified a high number of peptides that were observed by other methods. Mascot shows the highest number of uniquely identified peptides, which explains the high number of assignments. Similar analyses were carried out on the yeast data set (Table 2B), where MassWiz was assigning close to Mascot but OMSSA assigned significantly large number of spectra and peptides than all the algorithms (Figure 3A and B). While the number of spectra and peptides assigned by an algorithm has been traditionally used as a metric for comparing algorithms, the quality of assignments is generally not checked. The main reason is the subjective nature of manual validation, which also depends on the expertise of a person. We used an objective

method where we compared the agreement between algorithms as a measure of peptide quality. It has been shown that multiple algorithm consensus enhances the accuracy of the peptide identification.48 To compare the algorithms for their quality of matches, a set of high-confidence peptides is required. So, we mapped the overlaps between the algorithms for all identified peptides. For each data set, we segregated peptides identified by at least three algorithms and termed these as “high-confidence peptides”. The number of identified and missed high confidence peptides for each algorithm for the data sets is shown in Figure 4. The figure shows that MassWiz identifies the highest proportion of such peptides in four data sets, and in two data sets OMSSA performs slightly better. In yeast data set, most of unique OMSSA assignments were either single spectra or nonconsensus assignments. MassWiz lags a little behind Mascot, OMSSA and Sequest in the ABI-4700 data set. The data is also tabulated in Supplementary Tables 1A and 1B. Similar trends were observed for other highconfidence peptides identified by at least 2 algorithms and at least 4 algorithms, which strengthens the confidence in these observations (Supplementary Figure 1 and 2.). Overall, MassWiz identifies most number of high confidence peptides considering all standard mixture data sets together and missed the least number of such peptides. This makes MassWiz a versatile and useful algorithm for various instrument platforms and well suited to high mass accuracy data, which are gaining popularity owing to fast technological improvements. It has been previously shown that consensus of three search algorithms can yield higher sensitivity and specificity than a single search engine.17 MassWiz agrees highly with the consensus of three algorithms, which makes it highly useful when used singly or in combination with other algorithms.

’ CONCLUSIONS Our results show that MassWiz is an efficient, accurate, and versatile algorithm. Being open-source and configurable, modifications to the scoring function or development of supplementary plug-ins can be easily achieved through community participation. The results show that MassWiz is an effective algorithm for high-confidence peptide identification without compromising on the number of assignments. 2158

dx.doi.org/10.1021/pr200031z |J. Proteome Res. 2011, 10, 2154–2160

Journal of Proteome Research As ETD is being explored in greater details, we intend to extend the scoring algorithm to incorporate ETD data analysis for future work.

’ ASSOCIATED CONTENT

bS

Supporting Information Supplementary Table 1 shows the comparison of high-confidence peptides for the five algorithms in (A) six standard mix data sets and (B) yeast mid-log phase data set. Supplementary Figure 1 shows comparison of peptides identified by two or more algorithms. Supplementary Figure 2 shows comparison of peptides identified by four or more algorithms. This material is available free of charge via the Internet at http://pubs.acs.org.

’ AUTHOR INFORMATION Corresponding Author

*Fax: þ91 011 27667471. E-mail: [email protected].

’ ACKNOWLEDGMENT The authors thank Dr. Rajesh Gokhale, Dr. Anurag Agrawal, Dr. Shantanu Sengupta, and Dr. Akhilesh Pandey for their valuable suggestions. We also thank Dr. G. P. Singh for his insightful comments while proof-reading the manuscript. We thank Dhanashree S. Kelkar for providing input to the manuscript. The work was supported by CSIR SRF grant and CSIR network project on Plasma Proteomics  Health, Environment and Disease (NWP-04). ’ REFERENCES (1) Karas, M.; Hillenkamp, F. Laser desorption ionization of proteins with molecular masses exceeding 10,000 Da. Anal. Chem. 1988, 60 (20), 2299–2301. (2) Fenn, J. B.; Mann, M.; Meng, C. K.; Wong, S. F.; Whitehouse, C. M. Electrospray ionization for mass spectrometry of large biomolecules. Science 1989, 246 (4926), 64–71. (3) Steen, H.; Mann, M. The abc’s (and xyz’s) of peptide sequencing. Nat. Rev. Mol. Cell Biol. 2004, 5 (9), 699–711. (4) Aebersold, R.; Mann, M. Mass spectrometry-based proteomics. Nature 2003, 422 (6928), 198–207. (5) Washburn, M. P.; Wolters, D.; Yates, J. R., III Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat. Biotechnol. 2001, 19 (3), 242–247. (6) Matthiesen, R. Extracting monoisotopic single-charge peaks from liquid chromatography-electrospray ionization-mass spectrometry. Methods Mol. Biol. 2007, 367, 37–48. (7) Nguyen, N.; Huang, H.; Oraintara, S.; Vo, A. Peak detection in mass spectrometry by Gabor filters and envelope analysis. J. Bioinform. Comput. Biol. 2009, 7 (3), 547–569. (8) Zhang, S.; DeGraba, T. J.; Wang, H.; Hoehn, G. T.; Gonzales, D. A.; Suffredini, A. F.; Ching, W. K.; Ng, M. K.; Zhou, X.; Wong, S. T. A novel peak detection approach with chemical noise removal using shorttime FFT for prOTOF MS data. Proteomics 2009, 9 (15), 3833–3842. (9) Tabb, D. L.; Shah, M. B.; Strader, M. B.; Connelly, H. M.; Hettich, R. L.; Hurst, G. B. Determination of peptide and protein ion charge states by Fourier transformation of isotope-resolved mass spectra. J. Am. Soc. Mass Spectrom. 2006, 17 (7), 903–915. (10) Sadygov, R. G.; Hao, Z.; Huhmer, A. F. Charger: combination of signal processing and statistical learning algorithms for precursor charge-state determination from electron-transfer dissociation spectra. Anal. Chem. 2008, 80 (2), 376–386.

ARTICLE

(11) Flikka, K.; Martens, L.; Vandekerckhove, J.; Gevaert, K.; Eidhammer I. Improving the reliability and throughput of mass spectrometry-based proteomics by spectrum quality filtering. Proteomics 2006, 6 (7), 2086–2094. (12) Salmi, J.; Nyman, T. A.; Nevalainen, O. S.; Aittokallio, T. Filtering strategies for improving protein identification in high-throughput MS/MS studies. Proteomics 2009, 9 (4), 848–860. (13) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 2002, 74 (20), 5383–5392. (14) Eriksson, J.; Fenyo, D. The statistical significance of protein identification results as a function of the number of protein sequences searched. J. Proteome Res. 2004, 3 (5), 979–982. (15) Nesvizhskii, A. I.; Vitek, O.; Aebersold, R. Analysis and validation of proteomic data generated by tandem mass spectrometry. Nat. Methods 2007, 4 (10), 787–797. (16) Nesvizhskii, A. I.; Aebersold, R. Analysis, statistical validation and dissemination of large-scale proteomics datasets generated by tandem MS. Drug Discovery Today 2004, 9 (4), 173–181. (17) Sultana, T.; Jordan, R.; Lyons-Weiler, J. Optimization of the use of consensus methods for the detection and putative identification of peptides via mass spectrometry using protein standard mixtures. J. Proteomics Bioinform. 2009, 2 (6), 262–273. (18) MacLean, B.; Eng, J. K.; Beavis, R. C.; McIntosh, M. General framework for developing and evaluating database scoring algorithms using the TANDEM search engine. Bioinformatics 2006, 22 (22), 2830–2832. (19) Keller, A.; Eng, J.; Zhang, N.; Li, X. J.; Aebersold, R. A uniform proteomics MS/MS analysis platform utilizing open XML file formats. Mol. Syst. Biol. 2005, 1, 2005. (20) Tanner, S.; Shu, H.; Frank, A.; Wang, L. C.; Zandi, E.; Mumby, M.; Pevzner, P. A.; Bafna, V. InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. Anal. Chem. 2005, 77 (14), 4626–4639. (21) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20 (18), 3551–3567. (22) Eng, J. K.; McCormack, A. L.; Yates, J. R., III An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 1994, 5 (11), 976–989. (23) Craig, R.; Beavis, R. C. TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20 (9), 1466–1467. (24) Geer, L. Y.; Markey, S. P.; Kowalak, J. A.; Wagner, L.; Xu, M.; Maynard, D. M.; Yang, X.; Shi, W.; Bryant, S. H. Open mass spectrometry search algorithm. J. Proteome Res. 2004, 3 (5), 958–964. (25) Kapp, E. A.; Schutz, F.; Connolly, L. M.; Chakel, J. A.; Meza, J. E.; Miller, C. A.; Fenyo, D.; Eng, J. K.; Adkins, J. N.; Omenn, G. S.; Simpson, R. J. An evaluation, comparison, and accurate benchmarking of several publicly available MS/MS search algorithms: sensitivity and specificity analysis. Proteomics 2005, 5 (13), 3475–3490. (26) Dagda, R. K.; Sultana, T.; Lyons-Weiler, J. Evaluation of the consensus of four peptide identification algorithms for tandem mass spectrometry based proteomics. J. Proteomics Bioinform. 2010, 3, 39–47. (27) Havilio, M.; Haddad, Y.; Smilansky, Z. Intensity-based statistical scorer for tandem mass spectrometry. Anal. Chem. 2003, 75 (3), 435–444. (28) Narasimhan, C.; Tabb, D. L.; VerBerkmoes, N. C.; Thompson, M. R.; Hettich, R. L.; Uberbacher, E. C. MASPIC: intensity-based tandem mass spectrometry scoring scheme that improves peptide identification at high confidence. Anal. Chem. 2005, 77 (23), 7581–7593. (29) Elias, J. E.; Gygi, S. P. Target-decoy search strategy for mass spectrometry-based proteomics. Methods Mol. Biol. 2010, 604, 55–71. (30) Elias, J. E.; Gygi, S. P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 2007, 4 (3), 207–214. (31) Kall, L.; Storey, J. D.; MacCoss, M. J.; Noble, W. S. Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. J. Proteome Res. 2008, 7 (1), 29–34. 2159

dx.doi.org/10.1021/pr200031z |J. Proteome Res. 2011, 10, 2154–2160

Journal of Proteome Research

ARTICLE

(32) Moore, R. E.; Young, M. K.; Lee, T. D. Qscore: an algorithm for evaluating SEQUEST database search results. J. Am. Soc. Mass Spectrom. 2002, 13 (4), 378–386. (33) Wang, G.; Wu, W. W.; Zhang, Z.; Masilamani, S.; Shen, R. F. Decoy methods for assessing false positives and false discovery rates in shotgun proteomics. Anal. Chem. 2009, 81 (1), 146–159. (34) Blanco, L.; Mead, J. A.; Bessant, C. Comparison of novel decoy database designs for optimizing protein identification searches using ABRF sPRG2006 standard MS/MS data sets. J. Proteome Res. 2009, 8 (4), 1782–1791. (35) Klimek, J.; Eddes, J. S.; Hohmann, L.; Jackson, J.; Peterson, A.; Letarte, S.; Gafken, P. R.; Katz, J. E.; Mallick, P.; Lee, H.; Schmidt, A.; Ossola, R.; Eng, J. K.; Aebersold, R.; Martin, D. B. The standard protein mix database: a diverse data set to assist in the production of improved Peptide and protein identification software tools. J. Proteome Res. 2008, 7 (1), 96–103. (36) Colinge, J.; Masselot, A.; Carbonell, P.; Appel, R. D. InSilicoSpectro: An open-source proteomics library. J. Proteome Res. 2006, 5 (3), 619–624. (37) Hoopmann, M. R.; Finney, G. L.; MacCoss, M. J. High-speed data reduction, feature detection, and MS/MS spectrum quality assessment of shotgun proteomics data sets using high-resolution mass spectrometry. Anal. Chem. 2007, 79 (15), 5620–5632. (38) Kast, J.; Gentzel, M.; Wilm, M.; Richardson, K. Noise filtering techniques for electrospray quadrupole time of flight mass spectra. J. Am. Soc. Mass Spectrom. 2003, 14 (7), 766–776. (39) Wysocki, V. H.; Tsaprailis, G.; Smith, L. L.; Breci, L. A. Mobile and localized protons: a framework for understanding peptide dissociation. J. Mass Spectrom. 2000, 35 (12), 1399–1406. (40) Tabb, D. L.; Huang, Y.; Wysocki, V. H.; Yates, J. R., III Influence of basic residue content on fragment ion peak intensities in low-energy collision-induced dissociation spectra of peptides. Anal. Chem. 2004, 76 (5), 1243–1248. (41) Breci, L. A.; Tabb, D. L.; Yates, J. R., III; Wysocki, V. H. Cleavage N-terminal to proline: analysis of a database of peptide tandem mass spectra. Anal. Chem. 2003, 75 (9), 1963–1971. (42) Khatun, J.; Ramkissoon, K.; Giddings, M. C. Fragmentation characteristics of collision-induced dissociation in MALDI TOF/TOF mass spectrometry. Anal. Chem. 2007, 79 (8), 3032–3040. (43) Kapp, E. A.; Schutz, F.; Reid, G. E.; Eddes, J. S.; Moritz, R. L.; O’Hair, R. A.; Speed, T. P.; Simpson, R. J. Mining a tandem mass spectrometry database to determine the trends and global factors influencing peptide fragmentation. Anal. Chem. 2003, 75 (22), 6251–6264. (44) Frank, A. M. Predicting intensity ranks of peptide fragment ions. J. Proteome Res. 2009, 8 (5), 2226–2240. (45) Bythell, B. J.; Suhai, S.; Somogyi, A.; Paizs, B. Proton-driven amide bond-cleavage pathways of gas-phase peptide ions lacking mobile protons. J. Am. Chem. Soc. 2009, 131 (39), 14057–14065. (46) Paizs, B.; Suhai, S. Fragmentation pathways of protonated peptides. Mass Spectrom. Rev. 2005, 24 (4), 508–548. (47) Cramer, R.; Corless, S. The nature of collision-induced dissociation processes of doubly protonated peptides: comparative study for the future use of matrix-assisted laser desorption/ionization on a hybrid quadrupole time-of-flight mass spectrometer in proteomics. Rapid Commun. Mass Spectrom. 2001, 15 (22), 2058–2066. (48) Yu, W.; Taylor, J. A.; Davis, M. T.; Bonilla, L. E.; Lee, K. A.; Auger, P. L.; Farnsworth, C. C.; Welcher, A. A.; Patterson, S. D. Maximizing the sensitivity and reliability of peptide identification in large-scale proteomic experiments by harnessing multiple search engines. Proteomics 2010, 10 (6), 1172–1189.

2160

dx.doi.org/10.1021/pr200031z |J. Proteome Res. 2011, 10, 2154–2160