LaCyTools: A Targeted Liquid Chromatography ... - ACS Publications

3 downloads 42145 Views 2MB Size Report
Jun 7, 2016 - These developments facilitate the use of targeted data analysis approaches ... example, large-scale studies on IgG Fc N-glycosylation were.
Article pubs.acs.org/jpr

LaCyTools: A Targeted Liquid Chromatography−Mass Spectrometry Data Processing Package for Relative Quantitation of Glycopeptides Bas C. Jansen,† David Falck,† Noortje de Haan,† Agnes L. Hipgrave Ederveen,† Genadij Razdorov,‡ Gordan Lauc,‡ and Manfred Wuhrer*,† †

Center for Proteomics and Metabolomics, Leiden University Medical Center, 2300RC Leiden, The Netherlands Department of Biochemistry and Molecular Biology, Faculty of Pharmacy and Biochemistry, University of Zagreb, A. Kovačića 1, HR10000 Zagreb, Croatia



S Supporting Information *

ABSTRACT: Bottom-up glycoproteomics by liquid chromatography−mass spectrometry (LC−MS) is an established approach for assessing glycosylation in a protein- and site-specific manner. Consequently, tools are needed to automatically align, calibrate, and integrate LC−MS glycoproteomics data. We developed a modular software package designed to tackle the individual aspects of an LC−MS experiment, called LaCyTools. Targeted alignment is performed using user defined m/z and retention time (tr) combinations. Subsequently, sum spectra are created for each user defined analyte group. Quantitation is performed on the sum spectra, where each user defined analyte can have its own tr, minimum, and maximum charge states. Consequently, LaCyTools deals with multiple charge states, which gives an output per charge state if desired, and offers various analyte and spectra quality criteria. We compared throughput and performance of LaCyTools to combinations of available tools that deal with individual processing steps. LaCyTools yielded relative quantitation of equal precision (relative standard deviation 20 ppm) in the [M+5H]5+ charge state due to a bad peak shape. Since the other QC passed the threshold and the influence of the interference on quantitation was judged to be minor, the bias incurred by exclusion was judged to be

Performance and Data Storage

The performance of LaCyTools was evaluated by using a set of 221 measurements of one IgG glycopeptide sample, which had been acquired over the period of one month. A complete workflow was performed on these measurements including alignment, calibration, integration, and calculation of the QC. The time was tracked using the log file created by LaCyTools. For comparison purposes, we tracked the processing time of the four AAT measurements using the individual packages of MZmine2, DA, and 3D Max. Notably, these require significantly more hands-on time than LaCyTools. Furthermore, because of the input and output (IO) performance limitations and size of the mzXML MSdata format, an optional data encoding scheme, based on BLOSC library, was included.29 BLOSC (blocking, shuffling, and compression) is a meta-compression library optimized for IO performance over compression efficiency. Data are written on disk as blocked, shuffled, and compressed chunks. Chunks are transmitted to a CPU cache and decompressed faster than by using traditional methods. The BLOSC library was first implemented as a filter in the pytables library, a python library that is built around the HDF5 library.30,31 The HDF5 library itself is a library for hierarchical multidimensional binary data, able to cope with extremely large and complex data sets. LaCyTools is able to encode all measurements in two extendable arrays where one array stores the m/z values, and the other array stores the intensities. The array contains a single row for each scan that was measured during the LC−MS run. We compared the performance of LaCyTools using the standard mzXML reading methods and the HDF5 methods as well as the file size using the four AAT measurements.



RESULTS AND DISCUSSION LaCyTools was developed to allow the (pre)processing of glycoproteomic LC−MS data in a robust and high-throughput manner. LaCyTools is able to perform tr alignment, mass spectra calibration, targeted data integration of all isotopes of a list of user defined analytes, and calculating QC (calculating for each analyte the mass error, S/N, and IPQ, all per charge state). We compared LaCyTools with several other software packages at individual 2203

DOI: 10.1021/acs.jproteome.6b00171 J. Proteome Res. 2016, 15, 2198−2210

Article

Journal of Proteome Research

Figure 2. EICs of main AAT glycopeptides. The EICs of the major glycopeptides are shown for each glycosylation site of AAT. The displayed chromatograms are extracted with a width of 0.05 Th using the three main isotopes of the quadruply and quintuply charged glycopeptides. These results show that the dominant glycan on N70 and N271 is the fully sialylated diantennary glycan, while on N107, the fully sialylated triantennary glycan is also present both in its fucosylated and afucosylated form. These results are in agreement with the literature.6

Figure 3. Alignment functions for fitting of AAT, measured with different LC gradients. A total of four analyses of AAT using differing LC gradients were aligned using LaCyTools. (A) gradient 1, (B) gradient 2, (C) gradient 3, and (D) gradient 4. The features used for alignment are shown before and after alignment (blue and red markers, respectively). Furthermore, the figure shows the function that was used for the alignment (blue) as well as the target line (red, striated). The manually estimated tr values from gradient 1 were used as target values.

To demonstrate the potential of LaCyTools, we performed a complete analysis of two glycoproteomic measurements sets: (1) the analysis of a single AAT sample measured using four different gradients and (2) the analysis of a single IgG sample measured 221-fold over the period of one month. The analysis included tr

steps of an entire glycoproteomic workflow, and an overview of the capabilities of the tested software packages is presented in Table 1. The LaCyTools package is released under the Apache 2.0 license and is freely available on GitHub (https://github. com/Tarskin/LaCyTools). 2204

DOI: 10.1021/acs.jproteome.6b00171 J. Proteome Res. 2016, 15, 2198−2210

Article

Journal of Proteome Research

Figure 4. Alignment of AAT chromatograms. A total of four chromatograms, measured using differing LC gradients, were aligned using LaCyTools. (A) EIC of feature N70-H5N4S2 before alignment. (B) EIC of feature N271-H5N4S2 before alignment. (C) EIC of feature N70-H5N4S2 after alignment. (D) EIC of feature N271-H5N4S2 after alignment. LaCyTools was able to align both features to within 1 s of the desired tr in all runs.

difference between the highest and lowest data point in a region as noise, is caused by slightly varying tr tolerances between the two methods. The alignment function of LaCyTools was tested using a single AAT sample that was measured using four different LC gradients. Alignment was performed using five features, namely N107H5N4F1S1 (m/z 1438.641), N107-H5N4S2 (m/z 1474.901), N70-H5N4S2 (m/z 1347.353), N70-H5N4F1S2 (m/z 1383.868), and N271-H5N4S2 (m/z 990.918), with all m/z values calculated as [M+4H]4+. The accurate tr of the features was determined as the peak maximum observed in the automatically generated EIC, and coupled to their target tr, as manually estimated from gradient 1 (Figure 2). A power law function was used to align the features to the target tr (Figure 3). Before alignment, the unaligned time difference (Δutr) between exact and accurate tr varied between −37.7 s for N271-H5N4S2 in gradient 3 and 20.5 s for N271-H5N4S2 in gradient 2 (Figure 4). Following alignment, all features had a residual time difference (Δtr), between exact and accurate tr, of less than 1 s (Figure 4; Supporting Information, Table S-3). Furthermore, the total root−mean− square of all Δtr was below 0.5 s. Additionally, we tested the alignment using 221 IgG glycopeptide runs of the same sample, measured over a 1 month period. A total of nine features were used for alignment of each LC−MS run (Supporting Information, Figure S-4). Before alignment, there is a maximum tr difference of 13.9 s between measurements (−6.8 s vs 7.1 s), with 657 of 1989 (9 × 221) features showing a Δutr of above 1 s

alignment, sum spectra creation, and calibration, after which targeted data and background integration took place, and QC were calculated (Figure 1). The analyte list was created by manual inspection of sum spectra created per peptide moiety using all spectra that were recorded around the spectrum with the highest intensity for the most abundant glycoform based on published literature for both the AAT and IgG glycopeptide samples.2,6 Identity of the main glycopeptides of AAT was confirmed by MS2 (Supporting Information, Figure S-3). No glycoforms additional to the ones published before were detected on the basis of MS1 and MS2 data. Alignment

The first part of the alignment requires the determination of the S/N of the features for alignment using the NOBAN algorithm. We have compared the NOBAN algorithm with both a manual approach as well as with the commercially available DA. The resulting S/N are generally 2.6-times higher than manually calculated values, the average of two features over four AAT measurements runs being 237 (LaCyTools) and 92 (manual). A full table comparing different S/N determinations can be found in the Supporting Information, Table S-2. The large difference between DA and LaCyTools is caused by the DA method overestimating the noise, as illustrated by the peak eluting before the N107-H5N4S2 glycopeptide being considered noise using the DA method (Supporting Information, Figure S-2C). Furthermore, the smaller difference between the manual S/N determination and the S/N reported by LaCyTools, using the 2205

DOI: 10.1021/acs.jproteome.6b00171 J. Proteome Res. 2016, 15, 2198−2210

Article

Journal of Proteome Research

Figure 5. Sum spectra of AAT glycopeptide clusters. (A) Sum spectrum of the glycopeptide cluster of QLAHQSN70STNIFFSPVSIATAFAMLSLGTK. (B) Sum spectrum of the glycopeptide cluster of ADTHDEILEGLNFN107LTEIPEAQIHEGFQELLR. (C) Sum spectrum of the glycopeptide cluster of YLGN271ATAIFFLPDEGK. All sum spectra were generated with a resolution of 100 data points per Th. Displayed glycopeptide structures are based on their accurate mass, previous literature, and MS2 where possible.6 All nonannotated major peaks could not be assigned to an AAT glycopeptide composition.

and the maximum observed Δutr being 8.1 s (Supporting Information, Table S-4). After alignment, 95.4% of these features showed a Δtr below 1 s of the target tr. The remaining 92 of 1989 (9 × 221) features, that showed a Δtr of more than 1 s, all showed a Δtr of less than 2 s, except for five features that showed a poor peak shape (Supporting Information, Table S-5). LaCyTools uses a power law function for alignment as opposed to a polynomial as it is used during the calibration. The advantage of power law over a polynomial fit is that the minimum or maximum will always be at x = 0. With the function minimum or maximum at x ≠ 0 in a polynomial fit, two scans could have the same tr after alignment (data not shown). However, a power law function performs similar to a polynomial function for runs where the above-mentioned problem does not occur (Supporting Information, Figure S-5). LaCyTools alignment was compared to an existing alignment program, MSAlign2, which uses a genetic algorithm approach, which performs a heuristic search that mimics natural selection.17,32 MSAlign2 requires a master file that it will use as an alignment target. Therefore, to ensure fair comparison, the target times for LaCyTools alignment were taken from the same run as the specified master file for MSAlign2. However,

MSAlign2 was unable to align the data, presumably due to the low amount of features that were detected in our sample. This result clearly demonstrates the difference between untargeted proteomic and targeted glycoproteomic experiments, namely the data density. Untargeted alignment works well for proteomics studies, but the requirement for a large amount of features hinders its applicability to targeted glycoproteomic study designs. Furthermore, LaCyTools alignment was compared to targeted peak detection and alignment using MZmine2.16 MZmine2 was able to detect 2000 peaks in the AAT measurements, which included the peaks of interest. The alignment of several analytes was evaluated, including the main glycopeptide H5N4S2 in each peptide cluster. After alignment, the Δtr of each analyte that was used for alignment was below 1 s (data not shown). LaCyTools also showed a Δtr of less than 1 s for all features that were used for alignment (Supporting Information, Table S-3). However, the main difference between the two methods is the ease of use, as illustrated by MZmine2 requiring the user to perform several sequential steps, each with its own set of parameters that need optimization. For instance, we spent several hours finding the exact parameters required to achieve an MZmine2 alignment 2206

DOI: 10.1021/acs.jproteome.6b00171 J. Proteome Res. 2016, 15, 2198−2210

Article

Journal of Proteome Research

Figure 6. AAT quantitation. Relative quantitation was compared for LaCyTools, 3D Max Extractor, and Bruker DataAnalysis using the three peptide clusters of AAT. (A) N70 peptide cluster. (B) N107 peptide cluster. (C) N271 peptide cluster. The glycopeptides that could not be detected with all methods are marked with a star. The 3D Max Extractor, Bruker DataAnalysis, and LaCyTools max quantitation option methods show similar results. Compared to LaCyTools, using the MS peak integration, higher values are yielded for the smallest and most abundant glycopeptide in each cluster. This is caused by the difference in quantitation methods, specifically MS peak intensity versus peak area.

where the Δtr for all features was below 1 s. Within LaCyTools, we simply set the tr tolerance to be the maximum observed Δtr ± 10%.

mass errors can be used to identify potential contaminant peaks. High-quality calibration is essential to allow the researcher to use a narrow m/z region during quantitation, leading to improved selectivity.

Sum Spectrum and Calibration

Relative Quantitation and Quality Control

LaCyTools performs calibration and quantitation on a sum spectrum that is created by summing all spectra within a specified tr region. The main advantage of using a sum spectrum to calibrate a peptide cluster is that features that elute either in the beginning or toward the end of the peptide cluster are present in the sum spectrum and can be used as calibrants. For calibration, we use a spline fit combined with a polynomial function.22 To evaluate the quality of the calibration, we calibrated both AAT and IgG glycopeptide samples. AAT measurements were calibrated using a sum spectrum resolution of 100 data points per Th (Figure 5). Per peptide cluster of AAT, H5N4S2, H5N4F1S2, H6N5S3, and H6N5F1S3 glycoforms were used as potential calibrants. The average residual mass error after calibration was below 1 ppm for all quadruply charged potential calibrants, and the SD varied between 1.0 and 5.9 ppm. For the quintuply charged potential calibrants, the program showed an average residual mass error below 2 ppm for four of the eight potential calibrants, while the remaining four potential calibrants showed average residual mass errors above 20 ppm (Supporting Information, Table S-6). The four calibrants that showed a higher residual mass error also showed an interference or a poor peak shape. The SD of the correctly identified calibrants was between 1.2 and 2.0 ppm. Furthermore, for the 221 IgG glycopeptide runs of the same sample, measured over a 1 month period (Supporting Information, Figure S-6), an average residual mass error below 1 ppm was found for all potential calibrants (H3N4F1, H4N4F1, H5N4F1, and H5N4F1S1) in all IgG subclasses, in both the doubly and the triply charged forms. The SD of the residual mass errors varied between 1.5 ppm for IgG2H5N4F1S1 and 4.2 ppm for IgG1-H5N4F1 (Supporting Information, Table S-7). Furthermore, the average residual mass error of noncalibrant glycopeptides was between −3 and 6 ppm in either doubly or triply charged forms (Supporting Information, Table S-8). The higher mass error for these peaks is caused by irregular peak shapes as both glycopeptides have less than 0.5% relative abundance. These results demonstrate that calibration of peptide clusters using LaCyTools yields highquality calibration. Furthermore, the postcalibration residual

Quantitation using LaCyTools was compared with quantitation using either DA or 3D Max. The 3D Max quantitation was performed using a wider quantitation region than used for LaCyTools and DA as 3D Max Extractor does not have an option for calibration. The profiles reported by DA, 3D Max, and the max quantitation option of LaCyTools are similar, as all use MS peak intensity. The values reported by MS peak area integration of LaCyTools differ slightly from both the DA and 3D Max values as is illustrated by the H5N4S2 glycopeptide being only 80.7% (SD 0.4%) for N70, while it is 86.8% (SD 0.1%) and 86.7% (SD 0.1%) with DA and 3D Max, respectively (Figure 6). The discrepancy in relative quantities is partially caused by an increase of the peak width with increasing m/z, something that is ignored when looking only at the maximum intensity. Another reason for the discrepancy is that less of the isotope pattern is considered in the DA and 3D Max methods. Both biases result in an underestimation of larger glycopeptides and an overestimation of smaller glycopeptides in DA, 3D Max, and the max quantitation option of LaCyTools. Furthermore, there was no difference between the variation observed for all four methods, as illustrated by the SD of the main AAT glycoform never exceeding 1% for all three peptide clusters (Supporting Information, Table S-9). We previously reported that MS peak area integration yielded a higher precision than MS peak intensity quantitation.22 We did not observe this for the AAT measurements, possibly due to the simplicity of AAT glycosylation with one dominant glycoform (>80% relative abundance) for N271 and N70. However, for N107, we did notice a minor improvement of MS peak area integration when compared to MS peak intensity quantitation. Therefore, we also compared the MS peak area integration with MS peak intensity quantitation using the 221 IgG measurements, which have a more complex glycosylation than AAT. The major glycoform of IgG1, H4N4F1, showed a significant difference between the precision of LaCyTools and 3D Max, with the SDs being 1.2% and 3.9% with LaCyTools and 3D Max, respectively (Brown−Forsythe test; p-value