Annotating Proteomic Mass Spectrometry Data for Qualitative and ...

4 downloads 15262 Views 10MB Size Report
Annotating Proteomic Mass Spectrometry Data for Qualitative and Quantitative Experiments by. Sean J. McIlwain. A dissertation submitted in partial fulfillment of  ...
Annotating Proteomic Mass Spectrometry Data for Qualitative and Quantitative Experiments

by

Sean J. McIlwain

A dissertation submitted in partial fulfillment of the requirements for the degree of

Doctor of Philosophy (Computer Sciences)

at the

University of Wisconsin – Madison

2008

c Copyright by Sean J. McIlwain 2008

All Rights Reserved

i

Abstract This dissertation presents and evaluates algorithms that annotate mass spectrometry data from qualitative and quantitative experiments. The problems encountered when using mass spectrometry to predict the condition of samples (e.g. disease), and to determine the relative differences in protein expression from two samples, motivate the development of these algorithms. This dissertation introduces algorithms that couple statistical learning and dynamic programming to annotate peaks from a mass spectrum into groups that correspond to the underlying chemistry of the studied samples. We demonstrate the utility of annotating these groups over the expert’s annotations and other similar algorithms currently in the field. During this process the potential of using a ”mostly” correct example set over a ”near-miss” example set is shown to improve the annotation results. Using these annotated groups as features for classification, rather than just the individual peaks, improves the prediction accuracy for Prion disease and provides a better reflection upon the underlying chemistry. Next, this dissertation derives and evaluates algorithms that annotate mass spectrometry data from quantitative experiments containing a metabolic label. The algorithm takes as input the annotated groups from the previous algorithm and subsequently matches the ”light” and ”heavy” pairs of molecules that have the same atomic formula, but are different in their atomic isotope content. Finally, this dissertation compares and contrasts algorithms that calculate the relative quantitative ratios from the annotated isotopic matched pairs. It then presents a metric for measuring the annotation and quantitative calculations simultaneously. Overall, this dissertation demonstrates the potential utility of statistical learning applied to the problems of qualitative and quantitative mass spectrometry.

ii

Acknowledgements There are many influences that have assisted me throughout my graduate experience here at the University of Wisconsin in Madison. I would mostly like to acknowledge my advisor, David Page, who has provided me with direction with the multi–disciplinary fields of biostatistics and computational biology. He has been instrumental with helping me with my ideas and fostering independence as a researcher. His patience and encouragement as an advisor were indispensable. I also would like to thank the other members of my thesis committee, Dr. Jude Shavlik, Dr. Mark Craven, Dr. Lingjun Li, and Dr. Mike Sussman. I had the privilege of working closely with Drs. Mike Sussman and Lingjun Li and appreciate their valuable insights into the properties and nuances that arise from the collection and processing of mass spectrometry data. Jude Shavlik and Mark Craven were both involved with my training at the University of Wisconsin and provided great instruction and advice for my talks, papers, and research. Their diligence and critiques are a great example to follow. I have had faculty and graduate students from other fields who have collaborated with me. Dr. Judd Aiken and Allen Herbst from the Department of Comparative Biosciences, provided the samples for the Prion study. Their insight and advice into the biology of prion disease have been invaluable to me for analysis and processing of their data. Joshua Schmidt from Dr. Lingjun Li’s group, from the Department of Pharmacy, collected the mass spectra for the Prion study. His ideas and assistance has been instrumental in my training as a researcher. I would also like to acknowledge Edward Huttlin from Dr. Mike Sussman’s group at the Department of Biochemistry, who provided me with his data for the metabolic labeling experiments. His knowledge and insight into the quantitative experiments for mass spectrometry enabled me to improve upon the algorithms developed for annotating the mass spectrometry data. There have been many graduate and post–doc colleagues who have influenced my graduate life while in Madison. I especially would like to thank Irene Ong and Ian Alderman, who convinced me to come to Madison for graduate school. Their continued friendship has been invaluable to me throughout

iii my graduate career. Adam Smith, Yue Pan, Frank Dimaio, and Bess Berg have been office mates with whom I had many valuable interactions and discussions with. Louis Oliphant, Mark Goadrich, Keith Noto, Lisa Torrey, Trevor Walker, Burr Settles, Eric Lantz, Ameet Soni, Mike Waddell, Larry Henrix, and Soumya Ray of the Machine Learning Group were a resource for advice on my talks and papers. Dr. Rich Maclin at the University of Minnesota in Duluth has influenced my graduate career with advice and good humor. During the course of my graduate training I have had many other experiences with friends from my extra–curricular activities. I would like to acknowledge friends from the Shorin–Ryu Karate Club, especially Kaicho Kurt Christensen, Sensei Louis–John Janowski, Sensei Shinya Nishizawa, and Sensei Rost Boutchko who have been my instructors and friends. Sensei Frank Thorne, Sensei Daniel Luption, Sensei Karlin Younger, Sensei Emily Lupton, and Chad Schmiedt are also friends of mine whose interactions and practice have been helpful. Sensei Stephanie Britton and Sensei Josh Gildea have been great friends, who have supported and listened to me during my graduate life. I thank Sensei Michaela Klyve–Wood, my girlfriend, for teaching me to not take life so seriously and for proofreading my thesis. From the Ryu–Kyu Kempo Association, I would like to thank Amor Kaicho, who has given me advise for both my martial arts and life training. I would also like to thank members from the Aikido of Madison group on Atwood Ave. Notably, John Stone, Robin Cooper, Kerry Connell, Jeff Hempel, and Barb Brown, whose instruction on Aikido has greatly influence my perspective on the art of Aikido and my life goals. I would like to thank the training partners at Aikido of Madison as well. I would also like to thank the Two–Ticks volleyball group, Scooter Software, and the Jumptown Swing dance group. These many diversions preserved my sanity while working on this dissertation. My roommates John Ma and Emily Stone were also a good influence. I also want to acknowledge the following funding sources: NLM 5T15LM007359, NSF 0534908, NSF MCB-0448369, and NIH 5-T32-GM08349. I also would like to thank Dr. Amy Harms and Dr. Greg Barrett-Wilt at the UW-Madison Biotechnology Center Mass Spectrometry Facility.

iv Finally, I would like to thank my parents, Mike McIlwain and Anita Litteer. I have had the unique advantage of having a chemist for a father and a computer scientist for a mother. Same goes for my step parents Jerry Litteer and Beth McIlwain. Other than the continued support and love they have always provided, their career influence has guided me to a field that combines both their disciplines.

v

Contents

Abstract

i

Acknowledgements

ii

1 Introduction

1

1.1

Disease Classification and Biomarker Discovery . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Extracting Quantitative Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.3

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.4

Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

2 Background 2.1

2.2

5

Computational Biology and Machine Learning . . . . . . . . . . . . . . . . . . . . . . . .

5

2.1.1

Classification Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

2.1.2

Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

2.1.3

Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

2.1.4

Receiver–Operating Characteristic (ROC) and Precision-Recall (PR) Curves . . . .

7

2.1.5

Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

Mass Spectrometry and its Applications in Biology . . . . . . . . . . . . . . . . . . . . . 10 2.2.1

Isotopic Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.2

High Pressure Liquid Chromatography . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.3

Tandem Mass Spectrometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.4

Isotopic Labeling Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

vi 2.3

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Using Dynamic Programming to Create Isotopic Distribution Maps from Mass Spectra

15

3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2

Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3

3.2.1

Na¨ıve Bayes Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.2

Determining Charge State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.3

Inter- and Intra- distance Features . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.4

Shape Probabilities Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.5

Remaining Features and Overall Classification . . . . . . . . . . . . . . . . . . . . 23

Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.1

Absolute Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3.2

Coarse Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3.3

Mono-Isotopic Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.4

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.5

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.6

Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.6.1

Valid Isotopes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.6.1.1

Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.6.1.2

Training Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.6.1.3

Mono-Isotopic Fine Scores . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.6.2

Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.6.3

Weighted F1-Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.6.4

Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.6.5

Full Peak List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.6.6

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.6.7

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.7

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.8

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

vii 3.9

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4 Using Isotopic Distribution Maps to Generate Features for Disease Classification 4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1.1

4.2

4.3

45

Prion Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2.1

Pre-processing the Mass-to-charge Peak Lists . . . . . . . . . . . . . . . . . . . . 47

4.2.2

Isotopic Distribution Annotation with Interleaving Peaks . . . . . . . . . . . . . . 48

4.2.3

Extracting Feature Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2.4

Pre-process the Feature Vector Data . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2.5

Classification of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3.1

Data Acquisition

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3.2

Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.3.3

Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.4

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.5

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.6

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.7

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5 Matching Isotopic Distributions from Metabolically Labeled Samples

68

5.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.2

Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2.1

Na¨ıve Bayes Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.2.2

Isotopic Distribution Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.2.3

Number of Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.2.4

Shape Probabilities Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.2.5

Mass-To-Charge and Mass Differences . . . . . . . . . . . . . . . . . . . . . . . . 79

5.2.6

Overall Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

viii 5.3

Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.3.1

Absolute Match Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.3.2

Mono–Isotopic Match Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.3.3

Mono–Isotopic Fine Match Scores . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.3.4

Training and Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.4

Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.5

Followup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.5.1

Followup Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.6

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.7

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.8

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.9

Funding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.10 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.11 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6 Obtaining Quantitative Information from Metabolically Labeled Isotopic Distribution Pairs 93 6.1

6.2

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.1.1

Calculating the Quantitative Ratios

. . . . . . . . . . . . . . . . . . . . . . . . . 94

6.1.2

Finding Differentially Expressed Peptides . . . . . . . . . . . . . . . . . . . . . . . 94

Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.2.1

Calibration Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.2.2

Two-Fold Match Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.3

Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.4

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.5

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.6

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.7

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.8

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

ix 7 Future Work and Conclusions 7.1

7.2

105

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 7.1.1

Isotopic Distribution Annotation Algorithm . . . . . . . . . . . . . . . . . . . . . 105

7.1.2

Disease Diagnosis and Biomarker Identification . . . . . . . . . . . . . . . . . . . 106

7.1.3

Isotopic Distribution Pairs Annotation Algorithm . . . . . . . . . . . . . . . . . . 106

7.1.4

Extracting Quantitative Information from Isotopic Matched Pairs . . . . . . . . . . 106

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

A Isotopic Distribution Annotation Improvement Figures

115

B Matching Isotopic Distribution Pairs Figures

149

x

List of Tables

2.1

Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.2

Dynamic Programming Table for Calculating Fib(5) . . . . . . . . . . . . . . . . . . . . .

9

3.1

Statistical Results of Classifier and Dynamic Programming Algorithms (see Section 5.3 for the distinction between absolute and coarse scores) . . . . . . . . . . . . . . . . . . . . . 32

3.2

Comparison of Mono–Isotopic Peak Finding . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3

Time taken generating the S(i, j)’s. Results are from an AMD Turion 64 × 2 Mobile Technology TL-50 (1.6 Ghz) Computer with 2GB Ram . . . . . . . . . . . . . . . . . . . 39

3.4

Summary of improved method F1–Scores. Improved method is with “Valid Isotopes” Search and Examples, M5Prime regression, and training the parameters using a weighted F1–Score on Mono–Isotopic Fine scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.1

Interleaving Isotopic Distribution Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2

Statistical Results of Dynamic Programming Algorithms with and without the interleaving algorithm (see 3.3.3 and 3.6.1.3 for the distinction between the Mono–Isotopic and the Mono–Isotopic Fine scores) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.3

Cross-Validation Accuracies for Prion Dataset . . . . . . . . . . . . . . . . . . . . . . . . 53

4.4

Permutation p–values for prion dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.5

Top ten peak clusters scored by information gain for the interleave and w/o interleave features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.1

Number of states and equations for number of look backs used in the dynamic programming algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

xi 5.2

Statistical results of baseline classification schemes using expert annotated isotopes . . . . 84

5.3

Statistical results of classifier and dynamic programming algorithms using expert annotated isotopes (see Section 5.3 for the distinction between the scores) . . . . . . . . . . . . . . 84

5.4

Statistical results of classifier and dynamic programming algorithms using expert selected peaks (see Section 5.3 for the distinction between the scores) . . . . . . . . . . . . . . . . 86

5.5

Statistical results of classifier and dynamic programming algorithms using machine selected peaks (see Section 5.3 for the distinction between the scores) . . . . . . . . . . . . . . . . 86

5.6

Statistical results of isotope annotation algorithm using expert and machine selected peaks (see 3.3.3 and 3.6.1.3 for the distinction between the scores) . . . . . . . . . . . . . . . . 87

5.7

Statistical Results of Regressor and Dynamic Programming Algorithms Using Machine Selected Peaks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.8

Statistical Results of Followup Using Machine Selected Peaks . . . . . . . . . . . . . . . . 89

6.1

Sequences and ratios used for this data set. Each set of sequences from one ratio is the result of one spectrum. For the modifications NQ stands for deaminidation (substitute N H2 with OH for Asn and Gln) and CisCAM is a modification to a cysteine called cardamidomethylation (+C2 H3 ON ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.2

Log Mean Squared Error Results for the perfect match annotations . . . . . . . . . . . . . 101

6.3

Two-Fold Match Scores for the Experts method versus the Dynamic Programming with mono–isotopic scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.4

Statistical Results of Dynamic Programming Algorithm for the 9-fold data (see 3.3.3 and 3.6.1.3 for the distinction between the Mono–Isotopic and the Mono–Isotopic Fine scores) 102

A.1 Modification options for the algorithm. Valid = “Valid Isotopes”. . . . . . . . . . . . . . 116 B.1 Modification options for the isotopic match algorithm. . . . . . . . . . . . . . . . . . . . 150

xii

List of Figures

2.1

Example Receiver–Operating Characteristic (ROC) curve . . . . . . . . . . . . . . . . . .

8

2.2

Example Precision–Recall (PR) curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

2.3

Recursion Tree for Calculating Fib(5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2.4

Example Mass Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.5

Example Isotopic Distribution from a high-resolution mass spectrometer . . . . . . . . . . 11

2.6

Tandem Mass Spectrometry Illustration. The fragmentation pattern for each individual ion can be searched in a peptide database to discern the sequence of the original molecule. 12

2.7

An example spectrum section of an isotopically labeled “light” and “heavy” molecule . . . 14

3.1

Example Peaks for use in an S-Matrix. Mass-to-charge ratio increases from left to right. . 18

3.2

Example na¨ıve Bayes Model for Estimating Isotopic Distribution Probabilities. . . . . . . . 19

3.3

Fitted Charge State Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.4

Curves and points using all features with expert peak-picked data ROC curve. . . . . . . . 28

3.5

Curves and points using all features with expert peak-picked data PR curve. . . . . . . . . 28

3.6

Curves and points using all features with machine peak-picked data (ROC curve). . . . . . 29

3.7

Curves and points using all features with machine peak-picked data (PR curve). . . . . . . 29

3.8

Curves without one feature (ROC curve). . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.9

Curves without one feature (PR curve). . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.10 Curves using one feature (ROC curve). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.11 Curves using one feature (PR curve). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.12 Mono–Isotopic F1 Scores for the improved method versus the original published work.

. . 40

3.13 Mono–Isotopic Precision Scores for the improved method versus the original published work. 41

xiii 3.14 Mono–Isotopic Recall Scores for the improved method versus the original published work. . 41 4.1

Example of Two Interleaving Isotopic Distributions . . . . . . . . . . . . . . . . . . . . . 48

4.2

Permutation Test of Undeisotoped Data Using All Features (P (X > 0.83) < 0.001) . . . . 54

4.3

Permutation Test of Undeisotoped Data Using Top 100 Info-Gain Features (P (X > 0.59) = 0.202) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.4

Permutation Test of Undeisotoped Data Using Top 10 Info-Gain Features (P (X > 0.63) = 0.079) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.5

Permutation Test of Deisotoped (w/o Interleave) Data Using All Features (P (X > 0.82) < 0.001) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.6

Permutation Test of Deisotoped (w/o Interleave) Data Using Top 100 Info-Gain Features (P (X > 0.79) = 0.001) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.7

Permutation Test of Deisotoped (w/o Interleave) Data Using Top 10 Info-Gain Features (P (X > 0.73) = 0.003) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.8

Permutation Test of Deisotoped (w Interleave) Data Using All Features (P (X > 0.84) < 0.001) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.9

Permutation Test of Deisotoped (w Interleave) Data Using Top 100 Info-Gain Features (P (X > 0.77) = 0.003) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.10 Permutation Test of Deisotoped (w Interleave) Data Using Top 10 Info-Gain Features (P (X > 0.73) = 0.004) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.11 ROC Curves of Undeisotoped Data Using All Features . . . . . . . . . . . . . . . . . . . . 60 4.12 ROC Curves of Undeisotoped Data Using Top 100 Info-Gain Features . . . . . . . . . . . 60 4.13 ROC Curves of Undeisotoped Data Using Top 10 Info-Gain Features . . . . . . . . . . . . 61 4.14 ROC Curves of Deisotoped (w/o Interleave) Data Using All Features . . . . . . . . . . . . 61 4.15 ROC Curves of Deisotoped (w/o Interleave) Data Using Top 100 Info-Gain Features . . . 62 4.16 ROC Curves of Deisotoped (w/o Interleave) Data Using Top 10 Info-Gain Features . . . . 62 4.17 ROC Curves of Deisotoped (w Interleave) Data Using All Features . . . . . . . . . . . . . 63 4.18 ROC Curves of Deisotoped (w Interleave) Data Using Top 100 Info-Gain Features . . . . . 63 4.19 ROC Curves of Deisotoped (w Interleave) Data Using Top 10 Info-Gain Features . . . . . 64

xiv 4.20 ROC Curves of Data Using All Features and Classifying by Vote . . . . . . . . . . . . . . 64 4.21 ROC Curves of Data Using Top 100 Info-Gain Features and Classifying by Vote . . . . . . 65 4.22 ROC Curves of Data Using Top 10 Info-Gain Features and Classifying by Vote . . . . . . . 65 5.1

Example na¨ıve Bayes Model for Estimating Isotopic Match Distribution Probabilities. . . . 73

5.2

Example of annotated mass spectrum. The distributions of the same color are the “light”“heavy” matched pair.

5.3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Example of 2–step algorithm requirement. The arrows indicate the “light”-“heavy” matched pair. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.4

F1-Scores using expert-annotated isotopes. DP-LB-X stands for the dynamic programming algorithm with lookback of X. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.5

F1-Scores using expert-selected peaks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.6

F1-Scores using machine-selected peaks . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6.1

Overall task for quantitative labeling analysis. The arrows indicate the “light”-“heavy” matched pair and the numbers are the corresponding quantitative ratios. . . . . . . . . . . 94

A.1 Mono–Isotopic F1–Scores using the section data . . . . . . . . . . . . . . . . . . . . . . . 117 A.2 Mono–Isotopic Fine F1-Scores using the section data . . . . . . . . . . . . . . . . . . . . 118 A.3 Mono–Isotopic F1-Scores of “Exhaustive” versus “Valid Isotopes” Search using the section data. X is either e–“Exhaustive” or v–“‘Valid Isotopes”. No significant change in F1 is observed in most cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 A.4 Mono–Isotopic Fine F1-Scores of “Exhaustive” versus “Valid Isotopes” Search using the section data. X is either e–“Exhaustive” or v–“Valid Isotopes”. No significant change in F1 is observed for most cases.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

A.5 Mono–Isotopic F1-Scores of “Near-Miss” versus “Valid Isotopes” Examples using the section data. X is either n–“Near–Miss” or v–“Valid Isotopes”. Significant improvement for “Valid Isotopes” in F1 is observed for most cases. . . . . . . . . . . . . . . . . . . . . . . 121

xv A.6 Mono–Isotopic Fine F1-Scores of “Near–Miss” versus “Valid Isotopes” Examples using the section data. X is either n–“Near–Miss” or v–“Valid Isotopes”. Significant improvement for “Valid Isotopes” in F1 is observed for most cases. . . . . . . . . . . . . . . . . . . . . 122 A.7 Mono–Isotopic F1-Scores of Mono–Isotopic versus Mono–Isotopic Fine training metric using the section data. X is either mi–Mono–Isotopic or mif–Mono–Isotopic Fine. No significant change in F1 is observed for most cases. . . . . . . . . . . . . . . . . . . . . . . . . 123 A.8 Mono–Isotopic Fine F1–Scores of Mono–Isotopic versus Mono–Isotopic Fine Training Metric using the section data. X is either mi–Mono–Isotopic or mif–Mono–Isotopic Fine. No significant change in F1 is observed for most cases. . . . . . . . . . . . . . . . . . . . . . 124 A.9 Mono–Isotopic F1–Scores of Na¨ıve Bayes versus M5Prime isotope scorer using the section data. X is either c–Classifier (Na¨ıve Bayes) or r–Regressor (M5Prime). In cases where the exhaustive search with the “near-miss” examples, significant decrease in F1 is observed.

. 125

A.10 Mono–Isotopic Fine F1–Scores of Na¨ıve Bayes versus M5Prime using the section data. X is either c–Classifier (Na¨ıve Bayes) or r–Regressor (M5Prime). In cases where the exhaustive search with the “near-miss” examples, significant decrease in F1 is observed. . . . . . . . 126 A.11 Mono–Isotopic F1–Scores of Weighted versus Unweighted Recall. X is either u–Unweighted or w–Weighted. No significant change in F1 is observed for most cases. . . . . . . . . . . 127 A.12 Mono–Isotopic Recall of Weighted versus Unweighted Recall. X is either u–Unweighted or w–Weighted. Insignificant change in F1 is observed for most cases. . . . . . . . . . . . . . 128 A.13 Mono–Isotopic Precision of Weighted versus Unweighted Recall. X is either u–Unweighted or w–Weighted. Insignificant change in F1 is observed for most cases. . . . . . . . . . . . 129 A.14 Mono–Isotopic Fine F1–Scores of Weighted versus Unweighted Recall. X is either u– Unweighted or w–Weighted. Insignificant change in F1 is observed for most cases. . . . . 130 A.15 Mono–Isotopic Fine Recall of Weighted versus Unweighted Recall. X is either u–Unweighted or w–Weighted. Insignificant change in F1 is observed for most cases. . . . . . . . . . . . 131 A.16 Mono–Isotopic Fine Precision of Weighted versus Unweighted Recall. X is either u– Unweighted or w–Weighted. Insignificant degrade in F1 of weighted over unweighted is observed for most cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

xvi A.17 Mono–Isotopic F1-Scores using the full data . . . . . . . . . . . . . . . . . . . . . . . . . 133 A.18 Mono–Isotopic Fine F1-Scores using the full data . . . . . . . . . . . . . . . . . . . . . . 134 A.19 Mono–Isotopic F1-Scores of “Exhaustive” versus “Valid Isotopes” Search using the full data. X is either e–Exhaustive or v–“Valid Isotopes”. . . . . . . . . . . . . . . . . . . . . 135 A.20 Mono–Isotopic Fine F1-Scores of “Exhaustive” versus “Valid Isotopes” Search using the full data. X is either e–Exhaustive or v–“Valid Isotopes”. . . . . . . . . . . . . . . . . . . 136 A.21 Mono–Isotopic F1-Scores of “Near-Miss” versus “Valid Isotopes” Examples using the full data. X is either n–“Near–Miss” or v–“Valid Isotopes”. . . . . . . . . . . . . . . . . . . . 137 A.22 Mono–Isotopic Fine F1-Scores of “Near–Miss” versus “Valid Isotopes” Examples using the full data. X is either n–“Near–Miss” or v–“Valid Isotopes”. . . . . . . . . . . . . . . . . . 138 A.23 Mono–Isotopic F1-Scores of Mono–Isotopic versus Mono–Isotopic Fine Training Metric using the full data. X is either mi–Mono–Isotopic or mif–Mono–Isotopic Fine. . . . . . . . 139 A.24 Mono–Isotopic Fine F1–Scores of Mono–Isotopic versus Mono–Isotopic Fine Training Metric using the full data. X is either mi–Mono–Isotopic or mif–Mono–Isotopic Fine. . . . . . 140 A.25 Mono–Isotopic F1–Scores of Na¨ıve Bayes versus M5Prime isotope scorer using the full data. X is either c–Classifier (Na¨ıve Bayes) or r–Regressor (M5Prime) . . . . . . . . . . . 141 A.26 Mono–Isotopic Fine F1–Scores of Na¨ıve Bayes versus M5Prime using the full data. X is either c–Classifier (Na¨ıve Bayes) or r–Regressor (M5Prime). . . . . . . . . . . . . . . . . . 142 A.27 Mono–Isotopic F1–Scores of Weighted versus Unweighted Recall. X is either u–Unweighted or w–Weighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 A.28 Mono–Isotopic Recall of Weighted versus Unweighted Recall. X is either u–Unweighted or w–Weighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 A.29 Mono–Isotopic Precision of Weighted versus Unweighted Recall. X is either u–Unweighted or w–Weighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 A.30 Mono–Isotopic Fine F1–Scores of Weighted versus Unweighted Recall. X is either u– Unweighted or w–Weighted.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

A.31 Mono–Isotopic Fine Recall of Weighted versus Unweighted Recall. X is either u–Unweighted or w–Weighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

xvii A.32 Mono–Isotopic Fine Precision of Weighted versus Unweighted Recall. X is either u– Unweighted or w–Weighted.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

B.1 Mono-Isotopic Match F1-Scores using the full data . . . . . . . . . . . . . . . . . . . . . 151 B.2 Mono-Isotopic Fine Match F1-Scores using the full data . . . . . . . . . . . . . . . . . . . 152 B.3 Mono–Isotopic Match F1-Scores of Classifier versus. 3-Step Lookback Dynamic Programming Annotation approach. X is either c–Classifier or 3–3-Step Lookback Dynamic Programming. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 B.4 Mono–Isotopic Fine Match F1-Scores of Classifier versus. 3-Step Lookback Dynamic Programming Annotation approach. X is either c–Classifier or 3–3-Step Lookback Dynamic Programming. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 B.5 Mono–Isotopic Match F1-Scores of using the isotope scorer as a set of features within the Match Score function. X– is either y–Yes or n–No. . . . . . . . . . . . . . . . . . . . . . 155 B.6 Mono–Isotopic Fine Match F1-Scores of using the isotope scorer as a set of features within the Match Score Function. X – is either y–Yes or n–No.

. . . . . . . . . . . . . . . . . . 156

B.7 Mono–Isotopic Match F1-Scores of Classifier (Na¨ıve Bayes) versus Regressor (M5Prime) for Isotopic Scorer. X– is either c–Classifier or r–Regressor. . . . . . . . . . . . . . . . . . 157 B.8 Mono–Isotopic Fine Match F1–Scores of Classifier (Na¨ıve Bayes) versus Regressor (M5Prime) for Isotopic Scorer. X– is either c–Classifier or r–Regressor. . . . . . . . . . . . . . . . . . 158 B.9 Mono–Isotopic Match F1-Scores of Classifier (Na¨ıve Bayes) versus Regressor (M5Prime) for Isotopic Match Scorer. X– is either c–Classifier or r–Regressor. . . . . . . . . . . . . . 159 B.10 Mono–Isotopic Fine Match F1–Scores of Classifier (Na¨ıve Bayes) versus Regressor (M5Prime) for Isotopic Match Scorer. X– is either c–Classifier or r–Regressor. . . . . . . . . . . . . . 160 B.11 Mono–Isotopic Match F1–Scores of Weighted versus Unweighted Recall. X is either u– Unweighted or w–Weighted.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

B.12 Mono–Isotopic Match Recall of Weighted versus Unweighted Recall. Unweighted or w–Weighted.

X is either u–

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

B.13 Mono–Isotopic Match Precision of Weighted versus Unweighted Recall. X is either u– Unweighted or w–Weighted.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

xviii B.14 Mono–Isotopic Fine Match F1–Scores of Weighted versus Unweighted Recall. X is either u–Unweighted or w–Weighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 B.15 Mono–Isotopic Fine Match Recall of Weighted versus Unweighted Recall. X is either u–Unweighted or w–Weighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 B.16 Mono–Isotopic Fine Match Precision of Weighted versus Unweighted Recall. X is either u–Unweighted or w–Weighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

xix

List of Algorithms

1

Algorithm for Annotating isotopic distributions . . . . . . . . . . . . . . . . . . . . . . . . 18

2

Pseudo–code for training the annotation algorithm . . . . . . . . . . . . . . . . . . . . . . 38

3

Algorithm for annotating isotopic distributions with interleaving isotopes . . . . . . . . . . 50

4

The 1–step look back algorithm for annotating matched isotopic pairs . . . . . . . . . . . . 72

5

The 2–step look back algorithm for annotating matched isotopic pairs . . . . . . . . . . . . 75

6

The 1-step look back dynamic program for both isotopic distributions and matched isotopic pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

1

Chapter 1 Introduction

The application of machine learning to problems in computational biology has been widely studied in a number of different areas. Machine learning has made contributions to genetic microarray analysis [4], protein structure and folding prediction [10, 47], and biomedical text analysis [14, 85]. Currently, there are chemical and biological issues in mass spectrometry where machine learning can be applied. Recent applications of mass spectrometry to the study of high-throughput proteomics, defined as the determination of an individual’s protein levels under certain conditions, have received much attention in the scientific community. Two applications of interest for mass spectrometry are finding biomarkers for disease and obtaining quantitative measurements for studies in systems biology.

1.1

Disease Classification and Biomarker Discovery Analyzing proteomic data generated from mass spectrometry shows promise for predicting

disease and finding biomarkers within bodily fluid samples (serum, urine, etc.). By taking mass spectra of affected and unaffected samples from subjects, potential features can be extracted that can separate the two groups using classification algorithms. One major challenge is generating predictive features that have statistical and biological relevance [2, 15]. A number of different approaches have been explored [19, 35, 52, 63, 64, 70, 77, 79, 91, 96]. There are many issues to consider when analyzing proteomic mass spectrometry data for classification. Detector saturation reduces the predictive power of the peak intensities of the sample’s mass spectrum. There are mass-to-charge (m/z) shifts of common peaks between spectra. Within

2 the spectra, there are noise peaks that are not indicative of the underlying biology. These peaks are caused by chemical noise and electronic noise during the mass spectrometry measurement. Also, there are redundant features due to isotopic distributions, adducts, multiple charge states, and peptide fragments occurring from proteolysis. Repeatability is a fundamental problem for classification using mass spectra. Complex interactions of the measured sample’s constituents affect the height of the spectra. Electric noise or mechanical drift of the mass-to-charge values are intermittent between scans made on different days or experiments. In mass spectrometry where ionization is in the solid state, heterogeneous samples give rise to different measurements depending upon what part of the sample is measured. Much care and preprocessing must be performed to obtain accurate and repeatable results from mass spectrometry.

1.2

Extracting Quantitative Information Another application of mass spectrometry is the use of isotopic labeling to obtain relative

quantitative measurements of samples under different conditions. Ordinarily, obtaining quantitative measurements from mass spectrometry is not possible because peak heights are not calibrated across spectra. By labeling two samples from the same species with a chemical label that has the same atomic formula but a different isotopic make-up, relative quantifications are made by measuring the ratios of the isotopic envelopes of the “light” and “heavy” labeled molecules. Such examples of these are the ICAT or iTRAQ methods [1, 6, 11, 21, 25, 30–33, 50, 65, 68, 73, 74, 81, 82]. Another type of label has a heavy isotope, such as 15 N, introduced to an organism’s diet long before sample collection; this approach is called metabolic labeling. A major challenge of a metabolic labeling experiment is finding the interesting matched isotopic envelopes in a fast and accurate way. For this dissertation, “interesting matched isotopes” is defined as a pair of isotopic distributions having a relative difference of equal to or more than a factor of two. Currently, analyzing this data involves processing a large number of spectra (of the order of 10,000). A high-throughput method is desired to automatically find these matched pairs from mass spectra and calculate the corresponding quantitative results.

3

1.3

Contributions This dissertation contributes to both machine learning and computational biology. For machine

learning, first, the dissertation shows that replacing the labels of ”near-miss” negative examples with a measure of closeness to the true answer can significantly improve annotation accuracy. Second, somewhat related to the study of ensembles in machine learning the dissertation shows that multiple measurements of an example may improve both the accuracy and reliability of the results; this is demonstrated in the context of finding biomarkers in mass spectrometry. For computational biology, this dissertation presents methods of annotating isotopic distributions and matched isotopic distribution pairs in mass spectra. These two methods are then adapted to assist with diagnosis, biomarker discovery, and quantitative measurements. This dissertation is organized as follows: Chapter 2 gives some background information and terminology for computational biology, machine learning, and mass spectrometry. Chapter 3 describes an algorithm for annotating isotopic distributions within a peak list derived from a mass spectrum. This algorithm utilizes machine learning coupled with dynamic programming, which is analogous to an approach by Craven et al. (2000) to a very different problem: predicting operons within a DNA sequence from the E. coli K-12 genome [17]. Chapter 4 demonstrates an application of this isotopic distribution annotation algorithm to disease diagnosis and biomarker discovery, where more biologically relevant and robust features are required. Chapters 5 and 6 deal with the quantitative part of this dissertation. Chapter 5 presents an algorithm that takes as input the isotopic distribution annotations from the algorithm described in Chapter 3 and matches the isotopic labeled pairs. Again, this algorithm employs the machine learning plus dynamic programming approach. Chapter 6 compares and contrasts methods for obtaining quantitative information from these matched isotopic pairs and gives an overall view of the accuracies for (a) finding the correct matched isotopic pairs, and (b) obtaining reliable quantitative measurements from these correctly matched pairs. Chapter 6 also proposes an adaptation of the algorithm which may dramatically reduce the amount of mass spectrometry data that is taken and analyzed in high throughput proteomic experiments.

4

1.4

Thesis Statement Algorithms that couple dynamic programming with machine learning are presented and evalu-

ated on the task of annotation for mass spectrometry. The thesis is that annotating mass spectra via this coupling gives accurate qualitative and quantitative results for high-resolution and high-throughput mass spectrometry experiments.

5

Chapter 2 Background

With the multidisciplinary nature of this dissertation, background information is provided for computational biology, machine learning, and mass spectrometry. These fields are quite broad, so only the pieces that overlap with this dissertation are discussed.

2.1

Computational Biology and Machine Learning Computational biology is the application of computer science and statistics to problems that

are prevalent in the study of biology. Example applications include micro–arrays and DNA sequence alignment. Often, computational biology involves applying a statistical modeling algorithm to a set of biological data. Machine learning provides a series of techniques and modeling algorithms taken from the artificial intelligence domain and applied to problems of predicting trends or class. These statistical models attempt to “learn” the underlying concept from data rather than fit a provided model. These modeling techniques take as input a set of examples, each described by a feature vector. Using this feature vector, the learned model determines an output value, which is indicative of the concept that is desired. Example tasks include classification and regression. 2.1.1

Classification Models Classification models try to learn a discrete output from the supplied feature vector. A concept

is generally of binary value (Yes or No), but can also be more than a two–class. There are many algorithms that attempt to learn the target class. Examples are decision trees, Bayesian networks,

6 and Support Vector machines [12, 42, 55, 60]. Some classifiers can also associate a probability to the output values. Thresholding upon this probability allows for the adjustment of the classifier’s performance statistics. 2.1.2

Regression Models Regression models operate much the same as classification models with the exception that

the output variable is now a real–value. Example algorithms are Linear regression, Support Vector Machines, neural networks, and regression trees [55]. Regression trees, such as M5Prime, associate a value to each input feature vector using linear regression coupled with a decision tree [90]. 2.1.3

Confusion Matrix To estimate the predictive ability of classification models, n-fold cross-validation is performed.

The data set is partitioned into n–random subsets. For one iteration, n − 1 subsets are combined into a training set for building the model. After the model is built from the training set, the performance of the model’s predictions on the held out samples, called the test set, is calculated. This process is repeated n times, holding out a different set of samples to build an average confusion matrix upon the test example prediction results. A confusion matrix (Figure 2.1) has four measurements as follows: • True Positive (TP) - Example is actually positive and correctly predicted as positive • True Negative (TN) - Example is actually negative and correctly predicted as negative • False Positive (FP) - Example is actually negative and incorrectly predicted as positive • False Negative (FN) - False Negative - Example is actually positive and incorrectly predicted as negative From this confusion matrix five performance measures are defined. Accuracy is the overall predictive performance of the classification model. Recall or True Positive Rate is the ratio of actual positives that were correctly predicted to the positives that were misclassified as negative. Precision measures the ratio of the actual positives that were correctly predicted to the negatives that were misclassified as positives. False Positive Rate measures the ratio of false positives to true negatives. F1-Score, or simply F1 is a measure of the 1st moment between precision and recall. This score is commonly used when the number of negative examples is much greater than the number

7 of positive examples. The metrics, as defined below, are used for the majority of this dissertation. Accuracy = (T P + T N )/(T P + T N + F P + F N )

(2.1)

Recall = T P/(T P + F N )

(2.2)

P recision = T P/(T P + F P )

(2.3)

F P R = F P/(F P + T N )

(2.4)

F 1 = (2 × P recision × Recall)/(P recision + Recall)

(2.5)

Predicted

True False

Actual True False True Positive (TP) False Positive (FP) False Negative(FN) True Negative (TN)

Table 2.1: Confusion Matrix

2.1.4

Receiver–Operating Characteristic (ROC) and Precision-Recall (PR) Curves Another way of measuring performance is through the use of ROC curves. The classification

scheme outputs a partial ordering on its predictions. We obtain a partial ordering from a probabilistic classifier by using the positive probability of an example. By thresholding on this probability, a series of points for the True Positive Rate (Recall) and False Positive Rate is obtained. Plotting these generates an ROC curve, an example of which is Figure 2.1. This curve helps describe the tradeoff between the True Positive Rate and the False-Positive Rate of the method [62]. Similarly, Precision–Recall curves are also studied. These curves, like ROC curves, measure a tradeoff, in this case, between precision and recall (see Figure 2.2). One advantage of PR curves is that they are more robust when the number of positive versus negative examples is unbalanced [18]. 2.1.5

Dynamic Programming Dynamic programming is a technique which builds upon optimally solved subproblems to

calculate an optimal answer [84]. A well known application of dynamic programming is for the

8

Figure 2.1: Example Receiver–Operating Characteristic (ROC) curve

Figure 2.2: Example Precision–Recall (PR) curve

9 n 0 1 2 3 4 5

F ib(n) 0 1 F ib(1) + F ib(0) F ib(2) + F ib(1) F ib(3) + F ib(2) F ib(4) + F ib(3)

Answer 0 1 1 2 3 5

Table 2.2: Dynamic Programming Table for Calculating Fib(5) Fib(5)

Fib(4)

Fib(3)

Fib(3)

Fib(2)

Fib( Fib(2)

Fib(1)

Fib(2) Fib(1) Fib(1) Fib(0) Fib(1 1) Fib(0)

Figure 2.3: Recursion Tree for Calculating Fib(5) calculation of Fibonacci numbers.

F ib(n) =

     n = 0, 0    

n = 1, 1        n > 1, F ib(n − 1) + F ib(n − 2)

(2.6)

This equation is recursive and exponential in run–time, and involves recalculating the sum of the previous sub–equations (See Figure 2.3. By utilizing dynamic programming, the numbers are calculated bottom–up, employing the previously stored F ib(n) answers to calculate the next (See Table 2.2).The computational time is reduced to O(n), or linear in the n. This demonstrates the power in utilizing dynamic programming to optimize algorithm that are recursive. There have been many applications of dynamic programming in biology, such as sequence alignment and time–warping [8, 46, 56]. Dynamic programming is a valuable tool in a computational biologist’s arsenal.

10

300

680

1060

1440

1820

2200

Mass-To-Charge Charge (m/z) Figure 2.4: Example Mass Spectrum

2.2

Mass Spectrometry and its Applications in Biology Mass spectrometry involves measuring the mass of the different components of a sample under

study. The first step of any mass spectrometry experiment is ionizing a sample, i.e., placing a positive or negative charge upon the sample. There are many different methods for ionizing a sample [75]. Once a sample is ionized, the sample is then placed in a mass detection system. Again, many different instruments are employed to obtain accurate mass measurements. Such examples are Time-of-Flight (TOF, [16]) and Fast-Fourier Transform (FTMS, [29]). Due to the ionization and the nature of the detection units, the x-value peak measurements are given as mass-to-charge ratios, whereas the yvalue is of arbitrary units, but is usually referred to as intensity or abundance (see Figure 2.4). Most mass spectrometers give a measure of their resolving power in parts-per-million, which is a measure of the width of the detected peaks in relation to the mass-to-charge. Within mass spectra, there are many structural features; one that is of interest to this dissertation is the notion of an isotopic distribution. 2.2.1

Isotopic Distributions To define isotopic distributions we first need to describe atomic isotopes. Atomic isotopes

are atoms with the same number of protons, but are different in their neutron content [7]. Isotopic

11

Multiple Peaks due to relative abundance of C, N, O, S atomic isotopes from the same compound

665

667

669 671 Charge (m/z) Mass-To-Charge

673

675

Figure 2.5: Example Isotopic Distribution from a high-resolution mass spectrometer distributions are multiple mass peaks arising within mass spectra that are from compounds of the same molecular formula, but are different in their atomic isotope composition. High resolution mass spectrometers are able to detect isotopic distributions. An example of this phenomenon is shown in Figure 2.5. These peaks are grouped together with a mass-to-charge spacing of ≈ 1/C, where C is the number of charges associated with the molecules during the ionization process. 2.2.2

High Pressure Liquid Chromatography Typically, only the most abundant proteins are detected in a mass spectrum. To see the

less abundant proteins, it is common to “fractionate” a sample. A frequently used fractionation technique is column chromatography [78]. Column chromatography is a system that separates a sample’s molecular constituents by differences in their chemical properties, e.g. polarity or size. High pressure liquid chromatography (HPLC) is an example column chromatography system that uses a timed mixture gradient of liquid passed through a packed column that contains chromatographic material [37, 92]. Due to the nature of the gradient and the chemical properties of the column’s material, the sample’s molecules will pass through the system at different rates. Using HPLC with mass spectrometry reduces the complexity of the sample for each time slice measured, but also

Ionize, Mass Spec

Intensity

12

Parent Spectrum

Isolate Ion

Sample

Mass-To-Charge (m/z)

Fragment

,

,

,

,

,

……..

Ionize, Mass Spec

Intensity

Mass Fingerprint Spectrum

Mass-To-Charge (m/z) Figure 2.6: Tandem Mass Spectrometry Illustration. The fragmentation pattern for each individual ion can be searched in a peptide database to discern the sequence of the original molecule. increases the amount of mass spectra to be collected and analyzed for a single sample. 2.2.3

Tandem Mass Spectrometry A common technique for assigning peptide sequences to mass peaks found in a spectrum

is fingerprinting with tandem mass spectrometry, also called MS–MS. Tandem mass spectrometry involves multiple mass spectrometry scans, with a fragmentation step. In The first scan of a sample, called the parent spectrum, peaks are selected to be identified. Each peak’s causal ions are then isolated and fragmented through a well studied process. A mass spectrum is then collected from the fragments to generate a “fingerprint” of the original molecular ion. Figure 2.6 illustrates this process. By searching a database with the fragmented spectrum and the original ion mass, the peptide sequence is inferred. Database searching tools, such as SEQUEST [23], MS-Fit [13], and PeptideProphet [45], are used for this step of the process.

13 2.2.4

Isotopic Labeling Experiments An especially important application of mass spectrometry is the use of a mixture of two (or

more) samples that are differentially labeled by chemicals that have the same molecular structure but are different in their atomic isotope composition. The mass spectrum now has every isotopic distribution corresponding to a given molecule from the ordinary sample and should also have a matching distribution downstream in mass-to-charge, for the isotopically labeled sample (see Figure 2.7). Using the ratio of these matching isotopic distributions, relative quantitative information is inferred for the two samples’ peptide content. Example experiments of this nature are IsotopeCoded Affinity Tags (ICAT), isobaric tags for relative and absolute quantization (iTRAQTM )1 and others [1, 11, 21, 25, 30–33, 65, 68, 73, 74, 81, 82]. iTRAQTM and ICAT involve reagents that chemically bind to the molecule. Metabolic labeling, however, involves replacing existing atoms with a heavy isotope in an organism’s diet (e.g. replace all 14

N’s with

15

N’s). An important difference between this technique of metabolic labeling versus the

iTRAQ and ICAT systems is that the spacing between the paired isotopes depends upon the number of labeled atoms (e.g. the number of nitrogens).

2.3

Conclusion These are some of the general concepts and techniques that are used throughout the rest of

this dissertation. More specific information is given within each of the following chapters as needed. The next chapter describes the approach taken to annotate a mass peak list into the underlying sample’s isotopic distributions.

1 iTRAQ

is a trademark of the Applera Corporation or its subsidiaries in the U.S. and/or certain other countries.

14

Figure 2.7: An example spectrum section of an isotopically labeled “light” and “heavy” molecule

15

Chapter 3 Using Dynamic Programming to Create Isotopic Distribution Maps from Mass Spectra

3.1

Introduction When analyzing high-resolution mass spectrometry data for qualitative and quantitative ex-

periments, the first step is to annotate each spectrum into its isotopic distributions. Isotopic distributions are collections of peaks that occur from the same molecular compound but contain different compositions of their atomic isotopes. These distributions give us multiple peaks, and hence multiple evidence sources, for specific peptides [13, 20, 23, 26, 45] or other molecules. Isotopic distributions can be particularly helpful in experiments with metabolomics or with organisms whose proteomes have not been predicted. Also, these isotopic annotations are helpful for quantitative metabolic labeling experiments [5, 40, 49, 58, 88, 94]. While some of the current algorithms for analysis of isotopes are concerned with deconvolving spectra into the mono-isotopic peaks, it would be useful to have entire isotopic distributions for a number of applications. One such application would be in quantitative proteomics experiments involving isotopic labeling where the ratios of the peak heights or areas are used for quantitative measurements. Also, using isotopic distributions as features could possibly prove to be more robust and biologically significant when compared to single peak features. Experiments such as these would benefit from a map of isotopes within spectra rather than a mono-isotopic deconvolved counterpart. We describe a method to annotate the isotopes within a mass spectrum. To accomplish this, we

16 propose an algorithm analogous to an approach by Craven et al. (2000) to a very different problem: predicting operons within a DNA sequence from the E. coli K-12 genome [17]. Their algorithm employs dynamic programming, building upon a na¨ıve Bayes model that predicts the probability of an operon given the data [55]. Using expert-constructed peak lists from the spectra, we show that the dynamic programming map algorithm achieves a dramatically superior true positive/false positive rate when compared to the classifier used to score isotopic distributions. We also extend the algorithm to handle the many noise peaks present when using machine-constructed peak lists rather than expert-constructed peak lists, and again the dynamic programming approach outperforms the classifier. By machine–constructed peak list we mean the lists that are generated using the mass–spectrometers peak–picking software whereas expert–constructed peak lists are those in which the expert removes the noise–peaks. While at first the performance in the machine-constructed peak case is quite low, relaxing the requirement of “entirely” correct isotopic annotations to “mostly” correct yields a substantial increase in performance. This relaxation comes from two observations: the annotations provided by the expert may or may not be the ground truth of the correct isotopic distribution annotation and the utility of identifying a distribution with every peak correct compared to a distribution containing most of the peaks including the “mono-isotopic” peak may be small depending upon the application. Modifications to the algorithm are made to increase its performance in annotating the isotopic distributions more correctly while reducing the run-time. One change is generating training examples that are a better representation of the underlying concept of an isotopic distribution. Another modification replaces the exhaustive search of an isotopic distribution with a heuristic that incorporates information about properties of a correct isotopic distribution. The following is the initial approach, which is published in Bioinformatics 2007 with co-authors David Page, Edward Huttlin, and Mike Sussman. This work was also presented at the Intelligent Systems in Biology and Medicine Conference held in Vienna, Austria in 2007 [53]. Section 3.6 discusses improvements on that published approach.

17

3.2

Approach Using probabilities from features of distributions such as length, shape, inter-distribution dis-

tances, and intra-distribution distances, we can construct a na¨ıve Bayes model, illustrated in Figure 3.2, to estimate the probability that a proposed distribution of peaks is an isotopic distribution. By “an isotopic distribution” we mean both that (1) every peak in the distribution arises from the same molecular compound with different combinations of isotopes and (2) no other peaks in the spectrum arise from the same molecular compound with the exception of those derived from different charge states (multiple charges can be assigned to a population of the same molecule). We can estimate the parameters (probabilities) of the na¨ıve Bayes model using either the literature or training data, i.e. some annotated spectra. In our work, we choose to estimate the parameters from hand-annotated spectra. Given a probability for each potential isotopic distribution (each run of consecutive peaks), we would like to map all the peaks of the spectrum into their isotopic distributions. We take the score of any peak in such a map to be the log (base 2) probability of the distribution to which it is mapped, and we take the score of a map to be the sum of the peak scores.1 We now describe a dynamic programming approach to find the optimal map with respect to this scoring function. In the dynamic programming approach, we use the distribution probabilities from the Bayes net to calculate S(i, j), where S(i, j) is the log probability that the peak distribution ending at position (peak number) i and having length j is a true isotopic distribution(Figure 3.1 illustrates an example of this assigning S(i, j)’s to peaks. M (i) denotes the optimal distribution map, or sequence of isotopic distributions, beginning with the first peak and ending at peak position i. For successive values of i, we consider extending the current distribution or starting a new one. Algorithm 1 describes this process. 1 We

sum the log probabilities by peak rather than by distribution to avoid artificially raising the scores of maps

with very long distributions. For a spectrum of length n the score of any map is now effectively the product of n values, regardless of the number of distributions used in the map. If we summed one score per distribution instead, a map placing all n peaks of a spectrum in a single distribution might score better than other maps simply because it is the sum of a single negative value (log probability) rather than the sum of many negative values. The same scoring approach was used by Craven et al. (2000) [17] for the task of finding operons in E. coli.

18

Figure 3.1: Example Peaks for use in an S-Matrix. Mass-to-charge ratio increases from left to right.

Data: Peak List Result: Isotopic Distribution Annotations Generate S(i, j); Generate M (i) using equations: M (0) = 0,

M (i) = max

   1 × S(i, 1) + M (i − 1), 2 × S(i, 2) + M (i − 2), ...,   i × S(i, i) + M (i − i)

Recover path taken to determine the isotopic distribution annotations; Algorithm 1: Algorithm for Annotating isotopic distributions

(3.1)

(3.2)

19

Isotopic Distribution

Number of Peaks

Shape

Charge State

Intra Distribution Distances

Inter Distribution Distances

Figure 3.2: Example na¨ıve Bayes Model for Estimating Isotopic Distribution Probabilities. We also store the length of the current distribution at every M (i) in order to recover the optimal map. The proof of the following theorem is analogous to the correctness proof for the operon–finding algorithm [17]. Theorem 1. Given any spectrum of n peaks, the algorithm returns an optimal—maximum scoring— map of that spectrum into isotopic distributions. The run-time of the algorithm is quadratic in the number of peaks, n, in the spectrum. Nevertheless, the run time can be further reduced in practice because it is unnecessary to consider possible isotope distributions with peaks of more than, say, twelve or fifteen peaks. Maintaining the validity of the theorem while making this reduction requires an assumption that an optimal map contains no distributions of length greater than this bound; in practice distributions of this length are given such low probabilities according to the model that they are unlikely to be returned anyway. Within a mass spectrum there are many peaks that occur due to chemical and electrical noise. These spurious peaks can occur within an isotopic distribution from a real peptide. To handle noise peaks, we devise a modification to the score matrix S(i, j). S(i, j) contains the best distribution between peak i and peak i − j, inclusive. This distribution can now be any subset of the peaks between i − j and i. We introduce a penalty for the exclusion of peaks between i − j and i. We then run dynamic programming as before, using the same formula as above. At every M (i), we must also store the optimal peak distribution that starts at position i − j and ends at j. We can then build the map using these optimal distributions at the M (i)’s. To prevent exponential explosion, we

20 place a bound on j. Assuming this bound is accurate, the revised dynamic programming algorithm maintains optimality. Due to the number of possible noise peaks within a spectrum, we also allow the algorithm to label an entire run of peaks as noise. If the maximum probability of the optimal distribution within a run of peaks is less than a certain threshold, called the noise threshold, the algorithm labels these peaks as a run of noise peaks, having probability of the noise threshold plus 0.1. 3.2.1

Na¨ıve Bayes Model A na¨ıve Bayes model is employed to calculate the isotopic distribution probabilities. The

features employed by the model encode data about charge state, distances between peaks within a potential distribution, distance between the first peak of the potential distribution and the last peak before it, distance between the last peak of the potential distribution and next peak after it, the shape of (sequence of peak heights in) the potential distribution, the number of peaks in the potential distribution, and the relative ratio of intensities of the two highest peaks in the distribution (repeated also for the two lowest peaks). The remainder of this section describes these features in more detail. The na¨ıve Bayes model assumes these features are independent of one another given the class (true isotopic distribution or not an isotopic distribution). Even when this assumption is violated, na¨ıve Bayes models often work better in practice than more complicated Bayesian models because the conditional independence assumption means the model needs to estimate fewer parameters from the data, often resulting in better parameter estimates. 3.2.2

Determining Charge State To determine the charge state, we calculate the best fitting line on a plot of peak number

(increasing from 1 to n peaks in the proposed distribution) vs. the m/z value for that peak. We require the slope of the line to be 1.0028/Z, where Z = 1,2,3,...,N. We also require the line to pass through the first peak. Z is the best fitting charge state for the distribution. The reason for calculating the charge state in this way is the following. The value of 1.0028 is the mass of a neutron and Z is the charge of the molecule. Owing to the nature of the measurement made by the mass spectrometer of a molecular compound, the mass peak values are divided by the number of charges within the molecule

21 (hence the m/z ratios for the x-axis of mass spectra). We can calculate probabilities of the various charge states by counting the frequencies of the charge states within the training peak spectra. For our data, we decided to use a maximum allowable charge state of 3. It is worth noting that for high mass molecules, other methods for determining charge state such as the Patterson or Fourier method may be more robust [71]. Incorporating these methods to improve our results is listed as future work. 3.2.3

Inter- and Intra- distance Features The intra-distance feature captures the degree to which the distances between peaks within

a potential isotopic distribution look like the distances we would expect in an isotopic distribution. The inter-distance feature represents the degree to which the distances at the borders of the potential distribution appear appropriate; specifically, the distance between the last peak before (first peak after) the distribution and the first peak (last peak) within the distribution should not be similar to distances we would expect within a true isotopic distribution. A complication is that the distances we expect within a true isotopic distribution vary with charge state. We therefore use the fitted line from the previous subsection (charge state feature) to determine the actual inter-distribution (inter) and intra-distribution (intra) distance features, through a method described in the remainder of this section. This determination involves estimating the error from the line (see Figure 3.3). We explain the intra distance feature calculation first and follow up with the inter distance probability. For a hypothetical isotopic distribution, we calculate the maximum squared error of the massto-charge values compared to the expected mass-to-charge values given the charge state as determined above. This error can be thought of as the maximum deviation of a distance between two consecutive peaks in the distribution from the distance we would expect in an ideal isotopic distribution, given the charge state. The computed squared error is then used as a feature, called the intra-distance feature, for our model. For the inter-distance feature, we take the charge state fit from the proposed isotopic distribution and calculate the squared error of the best-fit peak “outside” of this distribution. This error is the minimum of distances between either the last peak before the distribution and first peak inside or the last peak inside the distribution and first peak afterward. We also refer to the intra-distance

22

Figure 3.3: Fitted Charge State Line and inter-distance features as Inner/Outer Error. 3.2.4

Shape Probabilities Feature For shape, we mean the relative intensity patterns of the peak distribution. The ratio patterns

of the isotope peaks are dependent upon the molecular composition. To incorporate this, we derive a shape probability variable P (Shape) = (0 . . . 1), that estimates whether the relative intensities of the distribution follow the patterns seen previously within the training dataset. For estimating this shape probability, we use an array of weighted nearest neighbor classifiers that are trained based upon the ratios of the peaks relative to the highest peak within the proposed isotopic distribution.

23 Each weighted nearest neighbor classifier is trained using a specific number of ratios (Classifier 3 has three ratios, Classifier 4 has four ratios, etc.). The number of ratios that a particular distribution has determines which classifier to use for the prediction. For training, we generate a negative example from each positive example by randomly permuting the peak ratio order. From this classifier we get a value in the range (0 . . . 1) that is indicative of whether these ratios fit a positive isotopic distribution. For distribution of one or two peak ratios, we return a value of 0.5 to signify undecided. We use this value as a feature for the overall classifier. It is important to note that these classifiers are re-trained on every fold of cross-validation (discussed later), using only the training data for that fold, to avoid an overly-optimistic estimate of the performance of the method. We realize that another shape feature that could be used is to calculate exact isotopic distributions using a method such as Mercury [67]. Mercury employs “averagine” molecular compounds which are close to the mass of the observed isotopic distribution [66, 72]. This method is utilized in THRASH [36] and is listed as future work, to see if including it as a feature in the classifier improves upon our current results. 3.2.5

Remaining Features and Overall Classification Additional features include the number of peaks in the distribution and the ratios of the most

intense peak to the mean and median of the entire peak list. We also include mean and median intensity ratios for the second highest peak and least intense peak. Another feature is the ratio of the highest peak intensity over the number of peaks in the distribution (Max Npeaks Ratio). We build a na¨ıve Bayes classifier using these features. Our na¨ıve Bayes classifier takes both discrete and continuous features, where it assumes continuous features follow a Gaussian distribution. Because not all continuous features are Gaussian, we also consider binning the continuous features such that each bin contains an equal number of the data points. The number of bins is tuned to maximize the area under the precision-recall curve of the feature. Tuning is repeated on every fold of cross-validation, to avoid over-optimistic estimates of performance. Tuning is performed by an inner loop of 10-fold cross-validation. If the area under the precision-recall curve (APR) for a Gaussian feature using the same cross-validation fold set is better than the binned feature, we use the Gaussian feature instead [18]. We use counts for the discrete

24 features.

3.3

Methods Mass spectra sections were obtained from labeled plant data in which Arabidopsis was grown in

liquid culture with natural abundance of

15

N-labeled MS salts. Labeled and unlabeled samples were

then combined, fractionated, and analyzed via LC-MS on a Micromass QTOF-II mass spectrometer [57]. A random selection of isotopic distributions was taken from the LC-MS analysis of a single digested protein fraction. Isotopic-labeled pairs were selected based on visual inspection for their clarity and an effort was made to include spectra representative of 1+, 2+ and 3+ charge states at a range of m/z values. This resulted in 99 sections of peak picked mass spectra that contain 214 hand-annotated distributions by a single expert (Edward Huttlin, Department of Biochemistry, University of Wisconsin, Madison). Note, some of the sections came from the same spectrum, but these do not overlap in their mass-to-charge values. For each isotopic distribution in the training set, we generate a positive example and a set of “near-miss” negative examples. These “near-misses” include distributions: • having an additional peak in the beginning or end of the distribution (2 negatives) • missing a peak in the beginning or end of the distribution (2 negatives) • consisting of a single peak from the true distribution (1 negative) • having one or two additional noise peaks within the distribution (1 negative for each peak, and 1 negative for each pair of peaks) We build our na¨ıve Bayes model using a training set built from the features generated as described previously. To score the generated isotopic map, we use three different metrics. We call them the absolute, coarse, and mono-isotopic scores. The absolute metric is over entire distributions. The second metric is over peaks, to essentially capture the fraction of peaks that are grouped correctly into an isotopic distribution. The third metric is also over peaks, with additional context restraints for finding the correct peak and distribution charge. For each score, we define the four quadrants of a contingency table, or confusion matrix. From these confusion matrices, we can calculate performance points for ROC and PR curves.

25 3.3.1

Absolute Scores Comparing the classifier versus the dynamic programming approach requires scoring the opti-

mal map in relation to the generated test examples. Therefore, we utilize an absolute score method yielding the following counts for a confusion matrix. • True Positive - Exact distribution appears in the map. • True Negative - Generated negative distribution does not appear within the map. • False Positive - Exact distribution does not appear within the map. • False Negative - Generated negative distribution appears within the map. This score is similar to the operon map scoring function as described in [17]. 3.3.2

Coarse Scores The absolute score counts an isotopic distribution wrong even if it misses one peak of the dis-

tribution or includes one extraneous peak (for example a noise peak). Nevertheless, a mostly-correct distribution is often very useful as a feature for machine learning or in approaches to quantitative mass spectrometry. The coarse score provides a way of giving credit for such mostly-correct distributions: • True Positive - Peak is actually in an isotopic distribution and predicted in an isotopic distribution. • True Negative - Peak is actually not in an isotopic distribution and predicted as not in an isotopic distribution. • False Positive - Peak is actually not in an isotopic distribution and predicted in an isotopic distribution. • False Negative - Peak is actually in an isotopic distribution, but predicted as not in an isotopic distribution. 3.3.3

Mono-Isotopic Scores For some applications, only the mono-isotopic mass is needed. Algorithms related to ours,

such as THRASH [36] and AID-MS [9], return the mono-isotopic peak for every predicted isotopic distribution. To compare our annotation method against these other algorithms, we introduce the Mono-Isotopic scores.

26 • True Positive - Peak is the mono-isotopic peak, and is predicted in a distribution having the same charge. • False Positive - Peak is not the mono-isotopic peak, but is predicted as one in a distribution, or the charge is not the same. • False Negative - Peak is a mono-isotopic peak, but not found in an isotopic distribution. • True Negative - Peak is not a mono-isotopic peak and is not found as a mono-isotopic peak in a distribution.

Our data set is a mixture of normal peptides and their

15

N isotopically labeled counterparts. For

natural abundance peptides, the mono-isotopic peak corresponds to the peak resulting from only the most common isotopes of each atom: 1 H,

12

C,

16

O,

14

N, and

32

S. Since each of these is the lowest

mass isotope, the mono-isotopic peak is that peak in each natural abundance envelope with the lowest m/z value. The situation is more complicated for

15

N-labeled samples:

15

N-labeled peptides do not

contain a truly mono-isotopic peak because every peak in the distribution can result from different combinations of labeled isotopes. However, when nearly complete

15

N enrichment is achieved, the

isotopic envelope of the labeled peptides takes on a shape that is similar to its natural abundance envelope, but shifted by one mass unit for each nitrogen in the peptide. By analogy with the unlabeled isotopic envelope, we will define the heavy mono-isotopic peak to be the peak within the labeled distribution which results predominantly from 1 H,

12

C,

16

O,

15

N, and

32

S.

The charge state is calculated using the previously described algorithm. We then can use the mono-isotopic mass-to-charge and the charge values to calculate the mono-isotopic mass. We run the annotation algorithm in two separate modes. One is using the original algorithm without regards to the noise peaks. We run this algorithm on the expert peak–picked dataset, where the noise peaks are removed. We call this the EP method. The second method, called MP, utilizes the noise peak handling algorithm. This method is run using the machine peak–picked data, where the noise peaks are not removed. We train the machine peak-picked (MP) algorithm using a grid of values for the noise threshold (0.0-0.9, step 0.025) and the noise peak penalty (0.0-0.2, step 0.005). Using this grid, we maximize the function 0.5 × (P + R) for the mono-isotopic scores of the training sections. We perform ten-fold

27 cross-validation generating ROC and PR curves for the classifier and ROC and PR points for each of the score metrics. The curve can be generated for the probabilistic classifier, but the isotopic distribution map commits to a particular set of distributions, not a ranking or probability estimate over distributions. Therefore, scoring the isotopic distribution map algorithm’s predictions using the absolute metric yields a point on the graph that signifies the overall ROC and PR scores for the isotope distribution map. These curves and points are generated using the MP algorithm. We also build a curve and point for the expert peak-picked algorithm (EP). The expert peak-picked algorithm only considers the peaks that are marked as being in an isotopic distribution.

3.4

Discussion Our resulting ROC and PR curves for the model and map using leave-one-out cross-validation

are shown in Figures 3.4-3.11. The statistical results are in Table 3.1. For comparison against the classifier’s ROC are PR curve, the annotation algorithm point is from the absolute scores. The absolute score performance measurement from the EP map gives a true positive rate of 96% and a false positive rate of 0.0% while the MP map gives a true positive rate of 22.0% and a false positive rate of 1.0%. The coarse score from the EP map gives a true positive rate of 100% and a false positive rate of 0.0%. When tuning upon the mono-isotopic score, the MP map obtains a coarse score true positive rate of 85% and a false positive rate of 4%. The corresponding rates for the mono-isotopic scores using the MP map are a true positive rate of 82% and a false positive rate of 1%. The coarse and mono-isotopic scores for the MP map show that the algorithm can annotate peaks that belong to an isotopic distribution and obtain enough information to determine the mono-isotopic masses reasonably well. Looking at the ROC plots that use the full set of features (Figures 3.4 and 3.6), the curves lie substantially above the diagonal. This indicates that the probabilistic (na¨ıve Bayes) classifier is performing much better than chance. The point corresponding to the isotopic distribution map in turn lies well above the ROC curve, indicating that the construction of the map further improves performance. In domains that have many potential false positives, such as this one, precision-recall (PR) curves are often used instead of ROC curves because they more clearly show differences between algorithms, especially in terms of the number of false positive predictions. The PR curves (Figures 3.5

28

1

True Positive Rate

0.8 0.6 0.4 0.2 Classifier EP Algorithm (Absolute Score)

0 0

0.2

0.4 0.6 False Positive Rate

0.8

1

Figure 3.4: Curves and points using all features with expert peak-picked data ROC curve.

1

Precision

0.8 0.6 0.4 0.2 Classifier EP Algorithm (Absolute Score)

0 0

0.2

0.4 0.6 Recall

0.8

1

Figure 3.5: Curves and points using all features with expert peak-picked data PR curve.

29

1

Classifier MP Algorithm (Absolute Score)

True Positive Rate

0.8 0.6 0.4 0.2 0 0

0.2

0.4 0.6 False Positive Rate

0.8

1

Figure 3.6: Curves and points using all features with machine peak-picked data (ROC curve).

1

Classifier MP Algorithm (Absolute Score)

Precision

0.8 0.6 0.4 0.2 0 0

0.2

0.4 0.6 Recall

0.8

1

Figure 3.7: Curves and points using all features with machine peak-picked data (PR curve).

30

1

All w/o Number of Peaks w/o Charge State w/o Inner/Outer Error w/o Max Peaks Stats w/o Min Peaks Stats w/o Max Npeaks Ratio w/o Shape

True Positive Rate

0.8 0.6 0.4 0.2 0 0

0.2

0.4 0.6 False Positive Rate

0.8

1

Figure 3.8: Curves without one feature (ROC curve).

1

All w/o Number of Peaks w/o Charge State w/o Inner/Outer Error w/o Max Peaks Stats w/o Min Peaks Stats w/o Max Npeaks Ratio w/o Shape

Precision

0.8 0.6 0.4 0.2 0 0

0.2

0.4 0.6 Recall

0.8

Figure 3.9: Curves without one feature (PR curve).

1

31

1

True Positive Rate

0.8 0.6 All w/ Number of Peaks w/ Charge State w/ Inner/Outer Error w/ Max Peaks Stats w/ Min Peaks Stats w/ Max Npeaks Ratio w/ Shape

0.4 0.2 0 0

0.2

0.4 0.6 False Positive Rate

0.8

1

Figure 3.10: Curves using one feature (ROC curve).

1

All w/ Number of Peaks w/ Charge State w/ Inner/Outer Error w/ Max Peaks Stats w/ Min Peaks Stats w/ Max Npeaks Ratio w/ Shape

Precision

0.8 0.6 0.4 0.2 0 0

0.2

0.4 0.6 Recall

0.8

Figure 3.11: Curves using one feature (PR curve).

1

32 Metric Classifier APR Classifier AROC Absolute Precision Recall F-Score False Positive Rate Coarse Precision Recall F-Score False Positive Rate Mono-Isotopic Precision Recall F-Score False Positive Rate

EP Algo 0.93 0.97

MP Algo 0.39 0.96

1.00 0.96 0.98 0.00

0.55 0.22 0.32 0.01

1.00 1.00 1.00 0.00

0.76 0.85 0.80 0.04

0.98 0.96 0.97 0.00

0.62 0.82 0.71 0.01

Table 3.1: Statistical Results of Classifier and Dynamic Programming Algorithms (see Section 5.3 for the distinction between absolute and coarse scores) Metric Precision Recall F-Score False Positive Rate

Our Method 0.62 0.82 0.71 0.01

AID–MS 0.21 0.66 0.32 0.07

THRASH 0.39 0.88 0.54 0.04

Table 3.2: Comparison of Mono–Isotopic Peak Finding and 3.7) show that building the map yields a significant advantage in detecting isotopic distributions. Since our dynamic programming approach uses the classification algorithm for building the map, any improvements to the probabilistic classification scheme should in turn improve the map building performance. The lesion and one-feature curves in Figures 3.8-3.11 give a hint of the relative value of each feature for determining an isotopic distribution. Lesion tests show that omitting any one of the Inter/Intra Error, Min Peaks Stats, and the Shape features reduces the classifier’s performance. The one-feature tests show that the Inter/Intra Error and the Min Peaks features classify well when used individually, though not as well as the full classifier. From this we conclude that given the current list of the available features, the Inter/Intra and the Min Peaks features are probably the most important

33 for the classifier.

3.5

Related Work Two systems that supply similar annotations are THRASH [36] and AID-MS [9]. THRASH

is a widely used algorithm for interpretation of high-resolution mass spectra. THRASH is focused on fitting theoretical “averagine” isotopic clusters through least squares in a moving window. The AID-MS algorithm implements a top-down (by decreasing spectral intensity) peak selection approach supplemented with novel charge state determination and other features to reduce false positive rates. THRASH and AID-MS return the mono-isotopic peak for every predicted isotopic distribution, rather than returning the full isotopic distribution. Therefore, to compare our method against these two methods, we decided to compare the performance of finding distributions that contain the monoisotopic mass peak. Key distinctions between our algorithm and both AID-MS and THRASH are the following. First, construction of a unique map has a tendency to increase precision, but at the cost of decreased recall. Second, our algorithm can be trained. This property has the potential to make it more robust for use on data from a wider array of experimental conditions, with varying machines and with isotopic distributions that occur as the result of isotopic labeling or other “un-natural” conditions, provided annotated training data for the conditions are available. For both the THRASH and AID-MS algorithms, the results are given in Table 3.2. Included in this table is the mono-isotopic mass-to-charge, charge state, and mono-isotopic mass for each distribution found in the spectra. For both these methods, we use the smoothed spectrum from which the peak list section come from. To calculate statistics for THRASH and AID-MS, we compile a list of mono-isotopic masses from our annotated spectra sections and the corresponding results from THRASH or AID-MS. The one challenge in computing these metrics is that THRASH and AID-MS may place a mono-isotopic peak in a slightly different mass position than the expert-annotated data. To adjust for these offsets between the expert list and THRASH (or AID-MS) list, we employ the following algorithm. Do: Find closest match between the two lists (distance is the difference in mass) if the difference is less then delta, then accept as a match and remove the matching masses from the two lists. Repeat

34 until no more matches are found. To try to be as fair as possible to THRASH and AID-MS we vary the delta from 0.0 to 0.5 to obtain the best F-score. After this algorithm completes, the following statistics are collected: • True Positive - number of matches found • False Positive - number of THRASH/AID-MS peaks left • True Negative - number of peaks in peak list minus (TP + FN + FP) • False Negative - number of annotated mass peaks left The results of these metrics applied to our annotated spectra set are given in Table 3.2. It appears that our algorithm achieves a significantly better F1–Score than THRASH and AID-MS for this data set. In the future, we plan to analyze data obtained from different mass spectrometers and compare these algorithms’ results. It is also worth noting that THRASH and AID-MS were originally developed to work with raw spectra, but we only used smoothed data. Using raw data may change the results for THRASH and AID-MS.

3.6

Improvements Since this work was published [53], we have identified modifications that can increase perfor-

mance and speed up the annotation algorithm. The exhaustive search for the best subset of peaks is changed to a more directed search of valid isotopic distributions. The training example generation is adapted to reflect this idea of valid isotopic distributions to give a better representation of the underlying concept. Regression is used in place of classification in order to allow for scoring of “mostly correct” isotopic distributions. Training the dynamic programming noise-penalty and noisethreshold parameters with a weighted F1-score is also attempted. We introduce a different scoring metric, which is a modification of the Mono–Isotopic Scores that assigns more credit for annotating peaks in the distribution with their monoisotopic peak. We explain, compare and contrast these different changes to the annotation algorithm with the original dataset. We also analyze these method upon a new dataset that has a peak list from the full spectrum rather than sections of data.

35 3.6.1

Valid Isotopes Looking at the structure of an isotopic distribution, the peaks are evenly spaced in mass-to-

charge value by 1/C ± ǫ, where C is the charge state and ǫ is an error term. Using this observation as a constraint, certain subsets of peaks are never considered as potential isotopic distributions. This idea is used in two different parts of the algorithm, the search between i and i − j for the best subset of peaks ending at i and the training example generation. For the remainder of this dissertation, we set ǫ = 150ppm. This value should be large enough to include all real isotopic distributions in the training set and allow a significant speed up in the generation of the S(i, j)’s. Ideally, this value could be trained from the data, and is left for future work. 3.6.1.1

Search Algorithm

Since the overall goal is to create an algorithm that is both fast and accurate, modifications to the algorithm are considered that speed up the annotation process. One way that is explored is to change the exhaustive search of the subset of peaks ending at i and starting at i − j to a heuristic search. Using the concept of “Valid Isotopes”, the heuristic search finds all subsets of the peaks that are evenly spaced for 1, 1/2, ...1/maxCharges ± ǫ. Starting with the ending peak i, the search algorithm scans for all peaks that are 1/C ± ǫ less than the mass-to-charge value of peak i. From this list of peaks, all peaks that are 1/C ±ǫ from them are found. This is recursively repeated, finding all subsets between i and i − j. These are now the “Valid Isotopes”. There is also a limit for the max number of peaks that can be in a distribution. These “valid” isotopes are now found through a more directed search that will speed up the generation of the S(i, j) and offer a potential increase in precision, since “invalid” isotopic distributions are not considered. This breaks the assumption of optimality, but as seen in the results does not significantly degrade accuracy. 3.6.1.2

Training Examples

Building upon the “valid” isotopic distribution idea, we change the generated examples for training the isotopic distribution scoring algorithm. Instead of using the “near-miss” examples, we now build a training set that has the “valid” isotopes generated from above as examples. This

36 provides a richer set of examples and allows the scorer to learn from examples that are based upon the directed search algorithm as explained previously. 3.6.1.3

Mono-Isotopic Fine Scores

As mentioned before, we would like to soften the scoring of an isotopic distribution map. The minimum requirement was to annotate distributions that contain the mono-isotopic peak. However, more credit should be given if the annotation algorithm can assign more peaks to the correct isotopic distribution. This new metric, called the Mono–Isotopic Fine score is a peak-based metric and defined as the following: • True Positive - Peak is actually in an isotopic distribution and predicted as in the distribution with the correct mono-isotopic peak. • True Negative - Peak is actually not in an isotopic distribution and not predicted as in an isotopic distribution. • False Positive - Peak is actually not in an isotopic distribution but predicted as in an isotopic distribution. • False Negative - Peak is actually in an isotopic distribution but not predicted in an isotopic distribution or peak is in a distribution that contains the incorrect mono-isotopic peak. 3.6.2

Regression In the previous work, we generated the negative “near-miss” isotopes by perturbing the iso-

topic distribution, and marking them as entirely incorrect [53]. However, these perturbed isotopic distributions can rather be scored as “mostly correct.” To capture this information, we now use a continuous output label (0-1), where 0 is a true-negative, 1 is a true positive, and the values between are a measure of how close the isotopic distribution is to the expert annotation. This score between the actual and predicted isotopic distributions is calculated using the following equation:    0 if P is missing the mono-isotopic peak, SI (P, T ) =    2Nm (P,T ) otherwise.

(3.3)

(NP +NT )

Where P is the predicted isotopic distribution, T is the true isotopic distribution, Nm is the number of matching peaks between T and P , and NP and NT are the number of peaks in P and T respectively.

37 We predict this score from the features with a regression function built from the training examples and their ”expert” SI value. The S(i, j) are now the log (base 2) score of the predicted SI score. So now the dynamic programming algorithm is attempting to maximize the predicted SI ’s in the map. For this study, we use a regression tree model called M5Prime [90]. 3.6.3

Weighted F1-Score A spectrometrist’s desire may be to detect more isotopic distributions at the expense of poten-

tially annotating more false positives. To achieve this, the algorithm’s parameters are trained using a weighted Fβ -score of the form [43]: Fβ =

(β 2 + 1)P R β2P + R

(3.4)

Where β = 8 for the weighted and β = 1 for the unweighted results. This value can be adjusted depending upon the user’s recall versus precision requirements. 3.6.4

Experiments The pseudo–code presented in Algorithm 2 illustrates the modifications for the training phase

of the annotation algorithm. Initially, we use every improvement described in sections 3.6.1– 3.6.3 for comparison against the previous work using the same 10 cross validation folds. To further discern the modifications that improve accuracy, 10-fold cross validation is performed using the section data and every possible combination of the improvements discussed above. This includes using valid isotopic distribution or near-miss examples, exhaustive or valid isotopic distribution subset searching, using a classifier or regression for isotope scoring, training the parameters with the Mono–Isotope or Mono–Isotopic Fine scores, and the weighted vs. un-weighted Fβ -score. There are 32 possible settings, which are summarized in Table A.1 . For these runs, we also train the search length from 5–12 and use 0.00-1.00, step 0.025 for both the noise-penalty and noise threshold parameters.

38 Data: Training peak lists Pk Result: SearchLength, np , nt , IsotopeScorer Generate training examples; Train IsotopeScorer;

// Near-miss or Valid Isotopes // Classifier (na¨ ıve Bayes) or Regressor (M5Prime)

foreach PeakList Pk do Generate Sk (i, j);

// Exhaustive or Valid Isotopes

end for SearchLength = minSearchLength to maxSearchLength do Find nt and np that maximizes Fβ from annotating Pk (using Sk (i, j)); // β =1 or 8 // Mono-Isotopic or Mono-Isotopic Fine end Return SearchLength, nt , and np that maximizes Fβ ; Algorithm 2: Pseudo–code for training the annotation algorithm

3.6.5

Full Peak List With the published work, only hand-picked sections of peak-lists were used. However, many

of these sections come from the same spectrum. Desirably, full peak lists should be annotated for training and testing instead. In this spirit, a new dataset is obtained that has a range of 800–4000 in mass–to–charge. This dataset consists of 41 peak lists containing 713 expert annotated isotopic distributions. They also contain isotopic matched–pair annotations which are explained in Chapter 5. We calculate statistic scores using all of the improvement options described. For comparison of general trends across the section and full peak list datasets, we also perform the same 32 simulations of each possible setting and calculate the statistics of the Mono–Isotopic and Mono–Isotopic Fine metrics. 3.6.6

Results Table 3.4 summarizes the Mono–Isotopic and Mono–Isotopic Fine scores using all of the pro-

posed improvement options for both the section and full peak list datasets. For completeness, the

39 Data Set Exhaustive Search Valid Isotopes Search Total Time (msec) Original 252016 9516 Full Peak List 540453 5765 Average Total Time (msec / number of peak lists) Original 2550 96.1 13182 141.0 Full Peak List Average Point Time (msec / (number of peaks per list × number of peak lists )) Original 30.8 1.16 Full Peak List 70.6 0.75 Table 3.3: Time taken generating the S(i, j)’s. Results are from an AMD Turion 64 × 2 Mobile Technology TL-50 (1.6 Ghz) Computer with 2GB Ram Mono–Isotopic and Mono–Isotopic Fine results for all 32 possible settings are presented in Figures A.1 and A.2. Each individual improvement effect is also plotted in Figures A.3–A.16. Similarly, the overall Mono–Isotopic and Mono–Isotopic Fine scores for the full peak list are given in Figures A.17 and A.18 respectively and the individual modifications are in Figures A.19–A.16. All of these figures are in Appendix A. To compare the time improvement of the exhaustive versus “Valid Isotopes” search, the total time taken to generate the S(i, j)’s for all the section and full peak list data is calculated. These results are summarized in Table 3.3. 3.6.7

Discussion The results as shown in Table 3.4 and displayed in Figures 3.12– 3.14 show significant im-

provement over the original work. However, from the figures plotted for each possible option in Appendix A more study is required to discern which options would most benefit the annotation algorithm. Looking at the results for each possible options of the improvements given, the following observations are made. 1) There is no statistical significance of the F1–Score performance from the exhaustive search versus the ”Valid Isotopes” search, but a significant decrease in the S(i, j) generation (Table 3.3). 2) There is an increase in F1–Score performance for most of the “Near-Miss” vs. “Valid-Isotopes” examples. It is interesting to note that this even improved most of the classifier results.

40 Training Metric Section Data Mono–Isotopic Mono–Isotopic Fine Full Peak List Data Mono–Isotopic Mono–Isotopic Fine

F1–Score 0.88 ± 0.05 0.77 ± 0.02 0.71 ± 0.06 0.76 ± 0.05

Table 3.4: Summary of improved method F1–Scores. Improved method is with “Valid Isotopes” Search and Examples, M5Prime regression, and training the parameters using a weighted F1–Score on Mono–Isotopic Fine scores. 1

0

.

8

0

.

6

0

.

4

0

.

2

e

r

o

c

S

1

-

F

0

I

m

p

M

r

e

o

t

v

h

e

o

d

d

O

r

M

i

e

g

i

t

h

n

a

o

l

A

I

D

-

M

S

T

H

R

A

S

H

d

Figure 3.12: Mono–Isotopic F1 Scores for the improved method versus the original published work. 3) Training the parameters using either the Monoisotopic or the MonoisotopicFine scores shows no significant difference 4) Using the M5Prime with the regression method does not perform well using the “Near-Miss” examples. Furthermore, using the “Valid Isotopes” examples with M5Prime is not significant over the Na¨ıve Bayes classifier. 5) Training the algorithm’s parameters using the weighted recall significantly increases the Recall in most instances, but at a cost of precision, which degrades the F1–Score. In conclusion, the “Valid Isotopes” search with the “Valid Isotopes” examples for training gives improved performance in both the F1–score and the time taken to generate the S(i, j)’s. The discussion of the tradeoff between recall and precision using the weighted recall will be more clear in Chapters 5 and 6 where annotating the isotopic pairs for quantification may need a better recall at the expense of precision. The full peak list gives similar results.

41

1

0

.

8

0

.

6

0

.

4

0

.

2

n

i

o

s

i

c

e

P

r

0

I

m

p

M

r

e

o

t

v

h

e

o

d

O

d

r

M

i

g

e

i

t

n

h

a

o

l

A

I

D

-

M

S

T

H

R

A

S

H

d

Figure 3.13: Mono–Isotopic Precision Scores for the improved method versus the original published work.

1

0

.

8

0

.

6

0

.

4

0

.

2

l

l

a

c

R

e

0

I

m

p

M

r

e

o

t

v

h

e

o

d

d

O

r

M

i

e

g

i

t

h

n

a

o

l

T

H

R

A

S

H

A

I

D

-

M

S

d

Figure 3.14: Mono–Isotopic Recall Scores for the improved method versus the original published work.

42 Due to the differences in mass of the amino acids, there are combinatorially many possible peptides near a particular mass. Overlapping distributions from different peptides are inevitable. A modified algorithm to handle overlapping distributions is given in chapter 4, where the isotopic annotations are used to improve the qualitative aspect of mass spectrometry for the prediction of Prion disease.

3.7

Future Work As seen in Figures 3.4 and 3.5, the algorithm performs well in the absence of noise peaks

(Figures 3.6 and 3.7). Trying other types of classifiers might also improve performance; possible classifiers include support vector machines, Bayesian network learning algorithms, tree-augmented na¨ıve Bayes, and probabilistic decision trees. Similarly, trying other regression functions may benefit the algorithm as well. Also, trying a different target function for the regression function to learn may give some more insight from improving the annotation method. Also, enhancing the features, (such as distribution shape), or introducing more features into the model could increase performance of the classifier and subsequently the map building algorithm. Additional features that capture some of the logic from AID-MS and THRASH would be beneficial. Using the Patterson or the Fourier method for determining charge state can be implemented as a set of features. Estimating isotopic distributions using “averagine” compounds around the mass of interest may in fact provide a better shape feature than the one presented in this paper. Isotopic distribution peaks generally increase in number as the mass-to-charge ratio increases. These features can possibly produce an improved score function. This new function will then allow the dynamic programming algorithm to produce better isotopic maps. With the dramatic speed increase of the S(i, j) matrix generation using the “Valid” isotopes heuristic, longer search lengths are now attainable. This may be beneficial if the isotopic distributions become longer or the number of noise peaks within an isotopic distribution increases. Training to find a tighter estimate of the search cutoff parameter ǫ may further reduce the S(i, j) matrix size and processing time while subsequently increasing the overall performance. Other directions would be to augment and test the algorithm upon other annotated MS datasets that could include raw, more complex, or less resolved data. Extending the algorithm

43 to integrate LC-MS data is another direction. To handle this we would add features to the classifier or regressor that take into account data from neighboring mass spectra scans. Since this algorithm annotates the distributions well, we can build feature vectors from the output. In those cases where the organism’s proteome is known, we propose to further map the isotopic distribution annotations to peptide annotations. This will require MS–MS data or peptide fingerprinting, using a variety of widely available tools for searching peptide databases (e.g. SEQUEST [23], MS-Fit [13], MASCOT [59]) and evaluating the significance of their assignments (e.g. PeptideProphet [45]).

3.8

Conclusions This chapter presented both a na¨ıve Bayes and a regression model for assigning scores to

potential isotopic distributions. Using these models with dynamic programming to map a spectrum into its isotopic distributions shows promise. The annotation algorithm also performs further removal of noise peaks, those not removed during initial preprocessing, while constructing the isotopic peak distributions. The algorithm is further improved in its F1-Score on finding the distributions that contain the Mono-isotopic peak as well as peaks that belong to the proper isotopic distribution using the Mono– Isotopic Fine score. Using a more directed search based upon the underlying chemistry and spacing of the mass-to-charge values of isotopic distributions does not improve the F1–score significantly, but greatly reduces the time taken for S-matrix generation. Using “Valid Isotopes” examples seems to improve the F1–score. Using regressors to estimate a score versus a classifier does not show a significant difference, but more classifier and regressor algorithms can be tried. In conclusion, the annotation algorithm presented here gives promising results. This algorithm will now be adapted for use in qualitative and quantitative applications for the rest of this dissertation.

3.9

Acknowledgments We thank Neil Kelleher, Craig Wenger and Richard LeDuc from the University of Illinois at

Urbana–Champaign for running THRASH on our data for us. This work is supported in part by the

44 NLM training grant 5T15LM00739, NSF grant 0534908, and NIH training grant 5-T32-GM08349. We would also like to acknowledge Dr. Amy Harms and the UW-Madison Biotechnology Center Mass Spectrometry Facility for technical advice and assistance.

45

Chapter 4 Using Isotopic Distribution Maps to Generate Features for Disease Classification

4.1

Introduction Analyzing proteomic data generated from mass spectrometry shows promise for predicting

disease and finding biomarkers within bodily fluid samples (serum, urine, etc.). By taking mass spectra of affected and unaffected samples, potential features can be extracted that can separate the two groups using classification algorithms. One major challenge is generating predictive features that have statistical and biological relevance [2, 15]. Many different approaches have been explored [19, 35, 52, 63, 64, 70, 77, 79, 91, 96]. Despite the strengths of several of these approaches, they do not make use of isotopic distribution information [3, 27, 87]. There are many issues to consider when analyzing proteomic mass spectrometry data for classification. Ion saturation and sample inhomogeneities reduce the predictive power of the peak intensities of the sample’s mass spectrum. There are mass-to-charge (m/z) shifts of common peaks between spectra that need to be accounted for. Peaks occurring from the matrix ions need to be handled as well. Within the spectrum, there are peaks caused by chemical and electronic noise during the sample acquisition that are not indicative of the underlying biology. Also, there are redundant features due to isotopic distributions, sodium adducts, multiple proton adsorptions, and peptide fragments occurring from proteolysis. This chapter presents a method that takes output from the isotopic distribution annotation

46 algorithm described in Chapter 3 to generate features that more closely match the underlying isotopic structure. This is a potential advantage over classification using just the peak lists. The case study will be mass spectra obtained from the cerebral spinal fluid (CSF) of hamsters infected with a transmissible spongiform encephalopathy, a.k.a. Prion disease. Another point in the approach presented here is the use of multiple mass spectra taken from one sample. This is to help reduce the classification prediction error when the number of training examples is small (N=43 for this case) when compared to the number of features. While using subsamples cannot completely substitute for independent samples, we observe an increase in prediction accuracy. 4.1.1

Prion Disease The transmissible spongiform encephalopathies (TSEs) are an unusual class of neurological

diseases that include scrapie in sheep and goats, bovine spongiform encephalopathy (BSE) in cattle, transmissible mink encephalopathy (TME), and chronic wasting disease (CWD) in elk and deer. Human forms of the disease include Creutzfeldt-Jakob disease (CJD), Gerstmann-Strussler-Scheinker syndrome, fatal familial insomnia, and kuru. TSEs are characterized by spongiform degeneration, reactive astrocytosis and prion protein (PrP) accumulation in the central nervous system. An extended incubation period of months to decades is characteristic of this group of fatal diseases [39]. Clinical features of TSEs vary with host species but, generally, clinically affected animals display pruritus, incoordination, and ataxia. TSE-infected animals do not display clinical symptoms throughout the majority of the infection’s duration. Currently validated diagnostic tests for prion diseases are all post-mortem, and all involve detection of abnormal accumulation of the prion protein in the central nervous system. The “gold standard” in prion diagnostic testing is immunohistochemistry utilizing anti-prion protein antibodies on the obex region of the brain. Drawbacks to utilizing immunohistochemistry on the obex include the low throughput of samples and the necessity for post-mortem analysis. High throughput antibodybased diagnostics, such as the Prionics or Bio-Rad tests, utilize a western blot/ELISA approach. These diagnostics take advantage of the protease resistance of the mis–folded prion protein. Despite the good specificity of these tests, the sensitivity is low such that animals infected with Prion disease cannot be diagnosed until late in the pre-clinical period when sufficient abnormal PrP has accumulated

47 in brain tissue. A further limitation of post-mortem diagnostics is their inability to monitor and track disease progression. As an alternative, we investigate the ability to predict Prion disease by analyzing mass spectra obtained from the cerebral spinal fluid of hamsters.

4.2

Approach In attempt to simplify the problem, we analyze peak lists provided by the mass spectrometer’s

software using the default settings. To classify the mass-to-charge data, there are four main steps: (1) pre-process the mass-to-charge peak data, (2) extract feature vectors from the peak data, (3) preprocess the feature data, and (4) classify the peak data. Using these four steps with cross-validation gives an estimate of the accuracy for the whole process. 4.2.1

Pre-processing the Mass-to-charge Peak Lists Some general cleanups of the mass spectrometry data is necessary to reduce biases and artifacts.

Peaks that occur due to the matrix material used for the ionization process must be removed. Also, spectrum to spectrum drifts in the mass-to-charge values for peaks arising from the same molecule are accounted for. Various different deisotoping methods remove redundant peaks occurring from the isotopic distributions. The matrix ionization material generally gives rise to many peaks within the 0–500 mass-tocharge range. These peaks are not biologically informative and can confuse the rest of the classification process. To suppress these ions, the process removes any peak of mass–to–charge value less than 500 from the data. Peaks that occur from the same molecule in different spectra can have a difference in their mass-to-charge values. Depending upon the spectrometer and the operating environment, these drifts can be quite significant. In an effort minimize these drifts, the processing algorithm performs a linear alignment using calibrating peaks. These calibrating peaks exist in most of the spectra and are associated with an isotopic distribution. Using a window of ± 500ppm, the most intense peaks are found in the spectrum that correspond to the calibrating peaks. The process now calculates the

Intensity

48

Mass-To-Charge Charge (m/z) Figure 4.1: Example of Two Interleaving Isotopic Distributions Dataset

Peak Lists

Sections (See Chapter 3) Full Peak List (See Section 3.6.5) Prions (This Chapter)

99 41 10

Isotopic Annotations 214 713 1205

Interleaving Annotations 0 6 284

Percent Interleave 0.00% 0.84% 24.00%

Table 4.1: Interleaving Isotopic Distribution Statistics correction to the mass–to–charge values using Equation 4.1. Corrected(m/z) = A × Current(m/z) + B

(4.1)

The factors A and B are calculated using least squares of the mass–to–charge values from the calibrating versus observed peaks. We repeat this process on each peak list. For this dataset the mass–to–charge calibrating peaks are 673, 1274.6, 2060, 2126, and 2326. Now the algorithm employs deisotoping to remove redundant peaks occurring from the same molecule due to the isotopic distributions. For comparison we perform 1) No deisotoping, 2) Isotopic Distribution Annotation deisotoping, and 3) Isotopic Distribution Annotation with interleave deisotoping. For the last two methods, we only keep the mono-isotopic peaks of the annotated distributions, which we define as the smallest mass-to-charge value within the annotated isotopic distribution. Using this peaks mass-to-charge value and the estimated charge state, the mass is calculated. This mass value along with the intensity is used in place of the annotated isotopic distribution peaks. 4.2.2

Isotopic Distribution Annotation with Interleaving Peaks We now describe a new dynamic programming algorithm that accounts for the chance that the

peaks of different isotopic distributions may interleave with each other on the mass–to–charge axis.

49 An example of this effect is illustrated in Figure 4.1. For the annotated training peak lists provided by the expert for the prion data, the amount of isotopic distributions that interleave with at least one other isotopic distribution is 24% (See Table 4.1).The previous isotope algorithm does not take this into account when annotating the isotopic distribution groups. The modification to the previous dynamic program given in Equation 3.2 is that the group of peaks ending at peak i now can add to the map from the i − j...i − 1 positions. M (i, j) is now defined as the optimal map ending at i of length j. To allow interleaving, the noise peak penalty does not apply if the “noise-peak” has already been annotated within the overlapping map. A penalty is introduced for annotating a peak that has already been annotated previously within the map. Algorithm 3 describes the resulting dynamic program. Since the assigned penalty is based upon the previous optimal path, S m (i, j) now contains all subsets of the peaks starting at i − j and ending at i, which are indexed by m, rather than the best scoring S(i, j). The γn is the noise penalty, and γo is the overlap penalty, or the cost of two isotopic distributions assigned to the same peak. The function Nn counts the number of noise peaks that S m has, excluding the ones that have already been assigned to an isotopic distribution in the path starting at M (k, l). No counts the number of peaks that S m (i, j) annotates that are already in an assigned isotopic distribution. If the S(i, j) is a noise run as defined in Chapter 3, then Equation 4.4 defines the M (i, j) . Where Nt is the noise threshold. For both isotope annotation methods we train on a set of ten annotated peak lists provided by the expert. For the annotation models we train the Na¨ıve Bayes Classifier using the “Valid Isotopes” examples. The “Valid Isotopes” method also generates the S m (i, j) values. The parameters are trained on a grid (0-1.0 step 0.1 for the noise peak penalty (γn ) and overlap penalty (γo ) and 00.9 step 0.1 for the noise threshold (Nt ) maximizing F1 of the Mono–Isotopic Fine scores upon the training data.

50 Data: Peak List Result: Isotopic Distribution Annotations Generate S m (i, j); Generate M (i) using equations:

M (0, 0) = 0

(4.2)

if S m (i, j) is noise then M (i, j) = maxm,k,l {i × S m (i, j) + M (k, l)+ γn Nn (S m (i, j), M (k, l))+

(4.3)

γo No (S m (i, j), M (k, l))}

l=

k = 0..i − 1    0, k = 0   1..k, k > 0

else M (i, j) = Nt + 0.1 + maxl M (i − j, l)

(4.4)

end Recover path taken to determine the isotopic distribution annotations; Algorithm 3: Algorithm for annotating isotopic distributions with interleaving isotopes

4.2.3

Extracting Feature Vectors From these peak lists we now want to generate a feature data set that provides examples

with features that and are a good representation of the underlying data. First, we need features that account for the mass–to–charge drifts of the peaks coming from same molecule. To do this, we employ a version of Tibshirani’s clustering algorithm that finds midpoints for peaks with similar mass-to-charge values. Using this algorithm midpoints are generated with a cutoff of ±50ppm from

51 their max–min midpoint [80]. We use discrete indicators for the existence of a peak within the bin as the feature values. 4.2.4

Pre-process the Feature Vector Data While classification using all of the features generated gives relatively good performance, re-

ducing the feature size gives spectrometrists a more manageable list of features/peaks for the purpose of peptide identifications. To reduce feature size, we keep the top 10 and 100 features scored by information gain. The cross validation results from these two feature selection options are compared and contrasted against the full feature set results. 4.2.5

Classification of Data Classification is performed using a linear SVM [60]. Linear SVMs have been shown to perform

well in high-feature low-example data [83]. For this step, we can use other classifiers, such as Bayesian networks and decision trees, as listed in the future work section.

4.3 4.3.1

Methodology Data Acquisition For this study, forty-three hamsters were orally inoculated with the Prion disease, which was

subsequently allowed to incubate for 18 weeks. The cerebrospinal fluid (CSF) and serum from the twenty–one infected and twenty–two uninfected hamsters of the same age was collected and the animals sacrificed. The CSF was then trypsinated overnight and submitted for mass spectrometric analysis. The mass spectrometer was an IonSpec Fourier transform mass spectrometer (FTMS) with a 7.0 T actively shielded superconducting magnet with a matrix assisted laser desorption/ionization (MALDI) front end. The MALDI source utilized a 337 nm nitrogen laser with a quadrupole ion guide to transfer ions into the analyzer cell, which is differentially pumped to 10e−10 torr. Each sample was spotted three times and each spot analyzed three times for a total of nine analyses for each animal to assess reproducibility and increase statistical confidence. For spotting samples, 0.6 µL of a saturated solution of 2, 5-dihydroxybenzoic acid (DHB, 150 mg in 500 µL purge and trap grade MeOH (Fisher) and 500 µL ddH2O) was spotted followed by 0.6 µL of the sample

52 onto the MALDI sample probe. After crystallizing at room temperature, the spots were analyzed on the FTMS using a method called in-cell accumulation (ICA), whereby multiple ionization/desorption events occur prior to detection, thus increasing the apparent ion concentration in the analyzer cell [51]. For this study, an ICA sequence with three pulses focused at 1000 m/z and four pulses focused at 3000 m/z were used, for a total of seven ionization/desorption events per acquisition. To increase signal-to-noise ratio and improve spectra quality further, five acquisitions were performed and their signals averaged into one spectrum for each of the nine analyses performed per animal. Each of the three spectra per spot was obtained from a different point on the spot. Due to the number of samples and mass spec acquisitions and the amount of time needed for the spotting and analyses of these samples, not all infected and uninfected samples from a given time point were analyzed on the same day. In attempt to minimize peak artifacts introduced by acquisition time, the same number of infected and uninfected samples are measured each day. After collecting the data, the peak list from each spectrum (generated automatically by the IonSpec software) was exported into a spreadsheet. The number of spectra totaled 387. 4.3.2

Analysis We calculate the model accuracy from the process of feature generation, feature selection, and

classification to ensure reliable prediction accuracy on future data. To measure our model accuracy, we performed 10-fold cross validation. For each hamster left out, all nine of its corresponding spectra are excluded from the training set. We measured accuracy using two different methods. For the first method, we first treated each spectrum as a separate sample. For the second method, we take the majority vote of the sample’s discrete classification values (1 - Infected, 0 - Uninfected) across the nine spectra. Note that we repeat the feature selection options (All, Top 10, and Top 100) on every fold of cross-validation. To further test the model’s accuracy, permutation tests for all nine different runs are performed [38]. The class labels for all 43 hamsters are permuted, keeping their multiple measurements intact, and the whole cross-validation process is repeated 1,000 times using the different deisotoping choices discussed in 4.2.

53 Scoring Metric Mono-Isotopic Precision Recall F-Score False Positive Rate Mono-Isotopic Fine Precision Recall F-Score False Positive Rate

w/o Interleave

w/ Interleave

0.77 ± 0.17 0.81 ± 0.04 0.77 ± 0.12 0.05 ± 0.03

0.75 ± 0.18 0.86 ± 0.05 0.77 ± 0.14 0.06 ± 0.04

0.78 ± 0.17 0.83 ± 0.04 0.78 ± 0.12 0.21 ± 0.10

0.76 ± 0.18 0.89 ± 0.04 0.79 ± 0.13 0.28 ± 0.12

Table 4.2: Statistical Results of Dynamic Programming Algorithms with and without the interleaving algorithm (see 3.3.3 and 3.6.1.3 for the distinction between the Mono–Isotopic and the Mono–Isotopic Fine scores) 4.3.3

Results and Discussion Deisotope Method Methods All Features None Annotated Annotated Interleave Top 100 None Annotated Annotated Interleave Top 10 None Annotated Annotated Interleave

Per Spectrum A R P FPR

Majority Vote A R P FPR

0.77 0.77 0.79

0.83 0.83 0.84

0.73 0.73 0.75

0.29 0.30 0.26

0.83 0.82 0.84

0.90 0.95 0.90

0.78 0.75 0.80

0.24 0.31 0.21

0.62 0.69 0.70

0.61 0.66 0.71

0.61 0.68 0.68

0.37 0.29 0.32

0.59 0.79 0.77

0.55 0.80 0.85

0.58 0.78 0.72

0.38 0.21 0.31

0.61 0.71 0.72

0.74 0.94 0.94

0.58 0.64 0.64

0.52 0.51 0.49

0.63 0.73 0.73

0.80 1.00 1.00

0.59 0.65 0.65

0.52 0.52 0.52

Table 4.3: Cross-Validation Accuracies for Prion Dataset Table 4.2 shows the isotope annotation algorithm cross–validation (10–fold) results and Table 4.3 depicts the cross-validated accuracy results of the classifications. Even with the repeated samples, the low sample size severely limits what we can infer from the data analysis. However, a few observations can be made. The classification accuracies suggest that using the majority vote after treating each spectrum as a separate sample gives the best accuracy results (Vote Test column in Table 4.3).

54

2

0

0

1

8

0

1

6

0

1

4

0

1

2

0

1

0

0

8

0

6

0

4

0

2

0

y

c

n

e

u

q

e

F

r

0

0

4

0

8

0

.

0

6

2

4

8

0

6

2

0

0

.

4

4 1

0

0

.

1

0

.

2

0

.

2

0

.

2

0

.

3

0

.

8

4

.

5

0

.

0

.

0

6

2

4

3

0

0

.

0

.

.

4

6

5

0

0

.

8

6

0

.

.

7

0

.

0

6

2

6

0

7

0

.

4

8

0

.

8

8

0

.

.

0

6

2

0

8

0

9

0

.

9

0

.

0

.

. 1

M

a

j

o

r

i

t

y

V

o

t

e

A

c

c

u

r

a

c

y

Figure 4.2: Permutation Test of Undeisotoped Data Using All Features (P (X > 0.83) < 0.001)

2

0

0

1

8

0

1

6

0

1

4

0

1

2

0

1

0

0

8

0

6

0

4

0

2

0

y

c

n

e

u

q

e

F

r

0

4

0

0

0

8

0

.

0

0

6

2

4

0

8

6

2

4

0

4

0

.

1

.

0

1

.

0

2

.

0

2

.

0

2

.

0

3

.

8

4

.

M

6

2

5

0

.

a

0

j

o

.

r

0

i

t

.

y

0

V

.

o

6

5

0

.

t

0

e

4

0

4

3

0

.

A

c

8

6

0

.

c

u

.

r

7

0

a

c

6

2

6

0

.

0

0

4

0

7

.

8

.

0

8

8

.

0

0

6

2

8

.

9

.

0

0

0

9

.

0

.

1

.

y

Figure 4.3: Permutation Test of Undeisotoped Data Using Top 100 Info-Gain Features (P (X > 0.59) = 0.202)

55 2

0

0

1

8

0

1

6

0

1

4

0

1

2

0

1

0

0

8

0

6

0

4

0

2

0

y

c

n

e

u

q

e

F

r

0

0

4

0

8

0

.

0

6

2

4

8

0

6

2

0

0

.

4

4 1

0

0

.

1

0

.

2

0

.

2

0

.

2

0

.

3

0

.

8

4

.

5

0

.

0

.

0

6

2

4

3

0

0

.

0

.

.

4

6

5

0

0

.

8

6

0

.

.

7

0

.

0

6

2

6

0

7

0

.

4

8

0

.

8

8

0

.

.

0

6

2

0

8

0

9

0

.

9

0

.

0

.

. 1

M

a

j

o

r

i

t

y

V

o

t

e

A

c

c

u

r

a

c

y

Figure 4.4: Permutation Test of Undeisotoped Data Using Top 10 Info-Gain Features (P (X > 0.63) = 0.079)

2

0

0

1

8

0

1

6

0

1

4

0

1

2

0

1

0

0

8

0

6

0

4

0

2

0

y

c

n

e

u

q

e

F

r

0

0

4

0

8

0

.

0

0

6

2

4

8

0

6

2

0

.

0

4

4 1

0

.

0

1

.

0

2

.

0

2

.

0

2

.

0

3

.

8

4

.

5

0

.

0

.

0

6

2

4

3

0

0

.

0

.

0

4

6

5

.

0

.

8

6

0

.

.

7

0

.

0

0

6

2

6

0

7

.

0

4

8

.

0

8

8

.

0

0

0

6

2

0

8

.

9

.

0

9

.

0

.

. 1

M

a

j

o

r

i

t

y

V

o

t

e

A

c

c

u

r

a

c

y

Figure 4.5: Permutation Test of Deisotoped (w/o Interleave) Data Using All Features (P (X > 0.82) < 0.001)

56

.

0

0

.

6

9

0

.

2

9

0

.

8

8

0

.

4

8

0

.

0

8

0

.

6

7

0

.

2

7

0

.

8

6

0

.

4

6

0

y

.

c

a

0

6

0

r

u

.

c

c

6

5

0

A

.

e

2

5

0

t

o

.

V

8

4

0

y

t

i

r

.

o

4

4

0

j

a

.

M

0

0

4

.

0

6

3

.

0

2

3

.

0

8

2

.

0

4

2

.

0

0

2

.

0

6

1

.

0

2

1

.

0

0

8

.

0

0

4

.

0

0

0

2

4

6

8

0

2

4

6

8

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

2

r

y

c

n

e

u

q

e

F

.

0

0

1 .

6

9

0

.

2

9

0

.

8

8

0

.

4

8

0

.

0

8

0

.

6

7

0

.

2

7

0

.

y

8

6

0

c

a

.

r

4

6

0

u

.

c

0

6

0

c

.

A

6

5

0

e

.

t

2

5

0

o

.

8

4

0

V

.

y

4

4

0

t

i

.

r

0

4

0

o

j

.

a

6

3

0

.

M

2

3

0

.

8

2

0

.

4

2

0

.

0

2

0

.

6

1

0

.

2

1

0

.

8

0

0

.

4

0

0

.

0

0

0

8

0

0

1

1

6

0

0

1

1

2

0

1

4

0

0

0

8

6

4

2

0

0

0

0

0

2

y

c

n

e

u

q

e

r

F

Figure 4.6: Permutation Test of Deisotoped (w/o Interleave) Data Using Top 100 Info-Gain Features

(P (X > 0.79) = 0.001)

1

Figure 4.7: Permutation Test of Deisotoped (w/o Interleave) Data Using Top 10 Info-Gain Features

(P (X > 0.73) = 0.003)

57 2

0

0

1

8

0

1

6

0

1

4

0

y

2

1

0

c

n

e

1

0

0

8

0

6

0

4

0

u

q

e

F

r

2

0

0

4

0

0

8

0

0

.

.

6

2

4

0

8

6

2

4

0

4

0

0

1

0

.

1

0

.

2

0

.

2

0

.

2

0

.

3

0

.

8

4

.

M

6

2

.

a

j

0

o

.

r

0

i

.

t

y

0

V

4

6 5

0

0

4

3

0

.

o

8

6

.

t

e

0

.

A

0

c

.

c

0

u

.

r

7

0

a

c

6

2

6

5

0

.

.

4

0

8

7

0

0

.

8

8

0

.

.

6

2

8

0

9

0

.

0

0

9

0

.

0

.

.

1

y

Figure 4.8: Permutation Test of Deisotoped (w Interleave) Data Using All Features (P (X > 0.84) < 0.001) 2

0

0

1

8

0

1

6

0

1

4

0

y

1

2

0

c

n

e

1

0

0

8

0

6

0

4

0

u

q

e

F

r

2

0

0

4

0

8

6

4

0

8

6

2

0

0

0

.

0

0

4

0

1

.

0

8

6

2

1

.

0

2

.

0

2

.

0

2

.

0

3

.

4

0

8

6

2

4

0

.

4

.

M

0

6

a

5

.

0

j

o

.

r

0

i

t

.

y

0

V

.

o

6

6

.

e

0

A

.

0

c

c

.

u

0

r

.

a

0

8

6

c

.

0

0

2

7

5

0

t

4

0

2

4

3

0

8

7

.

0

.

0

8

.

0

8

.

0

9

.

0

0

9

.

0

.

1

.

y

Figure 4.9: Permutation Test of Deisotoped (w Interleave) Data Using Top 100 Info-Gain Features (P (X > 0.77) = 0.003)

58 2

0

0

1

8

0

1

6

0

1

4

0

1

2

0

1

0

0

8

0

6

0

4

0

2

0

y

c

n

e

u

q

e

F

r

0

4

0

6

8

0

0

.

0

4

0

6

8

2

.

0

4

0

4

.

0

1

.

0

2

.

0

2

.

0

2

.

0

3

.

6

8

2

0 1

0

4

.

0

0

.

0

.

4

0

.

6

8

6 5

.

0

2

4

3

0

6

.

0

.

0

.

0

.

0

4

0

7

.

0

7

.

0

8

.

0

6

8

2

6

5

0

8

.

0

0

0

2

8

.

9

.

0

0

9

.

0

.

. 1

M

a

j

o

r

i

t

y

V

o

t

e

A

c

c

u

r

a

c

y

Figure 4.10: Permutation Test of Deisotoped (w Interleave) Data Using Top 10 Info-Gain Features (P (X > 0.73) = 0.004) Looking at the results in Table 4.3, it seems that the best method is to use all the features of the interleave deisotoping method with the majority vote. These results are too similar to be significant. However, when using only the top 10 or 100 peaks, the accuracies for the deisotoping methods seem to be more stable. The ROC curves in Figures 4.20–4.22 show the same trend, where the area–under the curve for the non–deisotoped data decreases significantly more than the data with the deisotoped methods. Furthermore, the average number of features generated from the raw peaks is 9900, whereas the isotope and the isotope interleave are 4500 and 4970 respectively before selection. The permutation tests are a tool for determining the prediction error of a classifier when the number of samples is low compared to the feature count. The permutation tests in Figures 4.2–4.10 show that the models built using all available features irrespective of deisotoping method, perform significantly better than random chance would predict. In this dissertation, we also use this to show the robustness of the features generated by the isotopic annotation methods. The estimated p–values are given in Table 4.4.

59 Deisotoping Method All Features None Annotated Annotated Interleave Top 100 None Annotated Annotated Interleave Top 10 None Annotated Annotated Interleave

Accuracy

P (X > Accuracy)

0.83 0.82 0.84

< 0.001 < 0.001 < 0.001

0.59 0.79 0.77

0.202 0.001 0.003

0.63 0.73 0.73

0.079 0.003 0.004

Table 4.4: Permutation p–values for prion dataset For all of the permutation tests, the resulting histograms look normal. When the classification model is built using all of the features, the resulting p–values for all three deisotoping method is significant at the