Additional Information
Book Details
Abstract
The field of proteomics has developed rapidly over the past decade nurturing the need for a detailed introduction to the various informatics topics that underpin the main liquid chromatography tandem mass spectrometry (LC-MS/MS) protocols used for protein identification and quantitation. Proteins are a key component of any biological system, and monitoring proteins using LC-MS/MS proteomics is becoming commonplace in a wide range of biological research areas. However, many researchers treat proteomics software tools as a black box, drawing conclusions from the output of such tools without considering the nuances and limitations of the algorithms on which such software is based. This book seeks to address this situation by bringing together world experts to provide clear explanations of the key algorithms, workflows and analysis frameworks, so that users of proteomics data can be confident that they are using appropriate tools in suitable ways.
This is a timely book for the proteomics researcher in guiding decision trees involved in the informatics pipelines of this rapidly developing field. The book does an excellent job in focusing on that part of the informatics discipline, both at the user interface and behind the scenes algorithms, that is of great importance to developing innovative and effective methods for protein interrogation.
Taufika Islam Williams
Conrad Bessant is Professor of Bioinformatics at Queen Mary University of London. He has particular interests in proteomics, software development and machine learning and is striving to ensure that everyone using proteomics data can access the latest analysis methods and knows how to use them in the most effective way.
Table of Contents
Section Title | Page | Action | Price |
---|---|---|---|
Cover | Cover | ||
Proteome Informatics | i | ||
Acknowledgements | v | ||
Contents | vii | ||
Chapter 1 - Introduction to Proteome Informatics | 1 | ||
1.1 Introduction | 1 | ||
1.2 Principles of LC-MS/MS Proteomics | 3 | ||
1.2.1 Protein Fundamentals | 3 | ||
1.2.2 Shotgun Proteomics | 5 | ||
1.2.3 Separation of Peptides by Chromatography | 6 | ||
1.2.4 Mass Spectrometry | 6 | ||
1.3 Identification of Peptides and Proteins | 8 | ||
1.4 Protein Quantitation | 9 | ||
1.5 Applications and Downstream Analysis | 9 | ||
1.6 Proteomics Software | 10 | ||
1.6.1 Proteomics Data Standards and Databases | 11 | ||
1.7 Conclusions | 12 | ||
Acknowledgements | 12 | ||
References | 12 | ||
Section I - Protein Identification | 15 | ||
Chapter 2 - De novo Peptide Sequencing | 17 | ||
2.1 Introduction | 17 | ||
2.2 Manual De novo Sequencing | 18 | ||
2.3 Computer Algorithms | 20 | ||
2.3.1 Search Tree Pruning | 20 | ||
2.3.2 Spectrum Graph | 21 | ||
2.3.3 PEAKS Algorithm | 24 | ||
2.4 Scoring Function | 26 | ||
2.4.1 Likelihood Ratio | 27 | ||
2.4.2 Utilization of Many Ion Types | 28 | ||
2.4.3 Combined Use of Different Fragmentations | 28 | ||
2.4.4 Machine Learning | 29 | ||
2.4.5 Amino Acid Score | 30 | ||
2.5 Computer Software | 31 | ||
2.5.1 Lutefisk | 31 | ||
2.5.2 Sherenga | 31 | ||
2.5.3 PEAKS | 31 | ||
2.5.4 PepNovo | 32 | ||
2.5.5 DACSIM | 32 | ||
2.5.6 NovoHMM | 32 | ||
2.5.7 MSNovo | 32 | ||
2.5.8 PILOT | 32 | ||
2.5.9 pNovo | 33 | ||
2.5.10 Novor | 33 | ||
2.6 Conclusion: Applications and Limitations of De novo Sequencing | 33 | ||
2.6.1 Sequencing Novel Peptides and Detecting Mutated Peptides | 33 | ||
2.6.2 Assisting Database Search | 34 | ||
2.6.3 De novo Protein Sequencing | 34 | ||
2.6.4 Unspecified PTM Characterization | 34 | ||
2.6.5 Limitations | 35 | ||
Acknowledgements | 35 | ||
References | 36 | ||
Chapter 3 - Peptide Spectrum Matching via Database Search and Spectral Library Search | 39 | ||
3.1 Introduction | 39 | ||
3.2 Protein Sequence Databases | 41 | ||
3.3 Overview of Shotgun Proteomics Method | 43 | ||
3.4 Collision Induced Dissociation Fragments Peptides in Predictable Ways | 44 | ||
3.5 Overview of Database Searching | 45 | ||
3.6 MyriMatch Database Search Engine | 47 | ||
3.6.1 Spectrum Preparation | 48 | ||
3.6.2 Peptide Harvesting from Database | 49 | ||
3.6.3 Comparing Experimental MS/MS with Candidate Peptide Sequences | 49 | ||
3.7 Accounting for Post-Translational Modifications During Database Search | 52 | ||
3.8 Reporting of Database Search Peptide Identifications | 53 | ||
3.9 Spectral Library Search Concept | 55 | ||
3.10 Peptide Spectral Libraries | 56 | ||
3.11 Overview of Spectral Library Searching | 58 | ||
3.12 Pepitome Spectral Library Search Engine | 59 | ||
3.12.1 Experimental MS2 Spectrum Preparation | 60 | ||
3.12.2 Library Spectrum Harvesting and Spectrum–Spectrum Matching | 60 | ||
3.12.3 Results Reporting | 62 | ||
3.13 Search Results Vary Between Various Database Search Engines and Different Peptide Identification Search Strategies | 62 | ||
3.14 Conclusion | 63 | ||
References | 64 | ||
Chapter 4 - PSM Scoring and Validation | 69 | ||
4.1 Introduction | 69 | ||
4.2 Statistical Scores and What They Mean | 71 | ||
4.2.1 Statistical Probability p-Values and Multiple Testing | 72 | ||
4.2.2 Expectation Scores | 72 | ||
4.2.3 False Discovery Rates | 73 | ||
4.2.4 q-Values | 74 | ||
4.2.5 Posterior Error Probability | 75 | ||
4.2.6 Which Statistical Measure to Use and When | 75 | ||
4.2.7 Target Decoy Approaches for FDR Assessment | 77 | ||
4.3 Post-Search Validation Tools and Methods | 80 | ||
4.3.1 Qvality | 80 | ||
4.3.2 PeptideProphet | 81 | ||
4.3.3 Percolator | 81 | ||
4.3.4 Mass Spectrometry Generating Function | 82 | ||
4.3.5 Nokoi | 83 | ||
4.3.6 PepDistiller | 83 | ||
4.3.7 Integrated Workflow and Pipeline Analysis Tools | 83 | ||
4.3.8 Developer Libraries | 84 | ||
4.4 Common Pitfalls and Problems in Statistical Analysis of Proteomics Data | 84 | ||
4.4.1 Target-Decoy Peptide Assumptions | 84 | ||
4.4.2 Peptide Modifications | 85 | ||
4.4.3 Search Space Size | 86 | ||
4.4.4 Distinct Peptides and Proteins | 87 | ||
4.5 Conclusion and Future Trends | 88 | ||
References | 88 | ||
Chapter 5 - Protein Inference and Grouping | 93 | ||
5.1 Background | 93 | ||
5.1.1 Assignment of Peptides to Proteins | 95 | ||
5.1.2 Protein Groups and Families | 97 | ||
5.2 Theoretical Solutions and Protein Scoring | 100 | ||
5.2.1 Protein Grouping Based on Sets of Peptides | 100 | ||
5.2.2 Spectral-Focussed Inference Approaches | 102 | ||
5.2.3 Considerations of Protein Length | 104 | ||
5.2.4 Handling Sub-Set and Same-Set Proteins within Groups | 105 | ||
5.2.5 Assignment of Representative or Group Leader Proteins | 108 | ||
5.2.6 Importance of Peptide Classification to Quantitative Approaches | 108 | ||
5.2.7 Scoring or Probability Assignment at the Protein-Level | 109 | ||
5.2.8 Handling “One Hit Wonders” | 111 | ||
5.3 Support for Protein Grouping in Data Standards | 112 | ||
5.4 Conclusions | 113 | ||
Acknowledgements | 114 | ||
References | 114 | ||
Chapter 6 - Identification and Localization of Post-Translational Modifications by High-Resolution Mass Spectrometry | 116 | ||
6.1 Introduction | 116 | ||
6.2 Sample Preparation Challenges | 118 | ||
6.3 Identification and Localization of Post-Translational Modifications | 120 | ||
6.3.1 Computational Challenges | 120 | ||
6.3.2 Annotation of Modifications | 122 | ||
6.3.3 Common Post-Translational Modifications Identified by Mass Spectrometry | 123 | ||
6.3.4 Validation of Results | 124 | ||
6.4 Conclusion | 129 | ||
Acknowledgements | 129 | ||
References | 129 | ||
Section II - Protein Quantitation | 133 | ||
Chapter 7 - Algorithms for MS1-Based Quantitation | 135 | ||
7.1 Introduction | 135 | ||
7.2 Feature Detection and Quantitation | 137 | ||
7.2.1 Conventional Feature Detection | 138 | ||
7.2.2 Recent Approaches Based on Sparsity and Mixture Modelling | 140 | ||
7.3 Chromatogram Alignment | 142 | ||
7.3.1 Feature-Based Pattern Matching | 143 | ||
7.3.2 Raw Profile Alignment | 143 | ||
7.4 Abundance Normalisation | 146 | ||
7.5 Protein-Level Differential Quantification | 147 | ||
7.5.1 Statistical Methods | 148 | ||
7.5.2 Statistical Models Accounting for Shared Peptides | 151 | ||
7.6 Discussion | 151 | ||
Acknowledgements | 152 | ||
References | 152 | ||
Chapter 8 - MS2-Based Quantitation | 155 | ||
8.1 MS2-Based Quantification of Proteins | 155 | ||
8.2 Spectral Counting | 156 | ||
8.2.1 Implementations | 158 | ||
8.2.2 Conclusion on Spectrum Counting | 158 | ||
8.3 Reporter Ion-Based Quantification | 161 | ||
8.3.1 Identification | 164 | ||
8.3.2 Reporter Ion Intensities, Interferences and Deisotoping | 165 | ||
8.3.3 Ratio Estimation and Normalization | 168 | ||
8.3.4 Implementation | 169 | ||
8.3.5 Conclusion on Reporter Ion-Based Quantification | 173 | ||
Acknowledgements | 175 | ||
References | 175 | ||
Chapter 9 - Informatics Solutions for Selected Reaction Monitoring | 178 | ||
9.1 Introduction | 178 | ||
9.1.1 SRM – General Concept and Specific Bioinformatic Challenges | 178 | ||
9.1.2 SRM-Specific Bioinformatics Tools | 180 | ||
9.2 SRM Assay Development | 182 | ||
9.2.1 Target and Transition Selection, Proteotypic and Quantotypic Peptides | 182 | ||
9.2.2 Spikes of Isotopically Labeled Peptides and Protein Standards and Additional Assay Development Steps | 183 | ||
9.2.3 Retention Time Regressions and Retention Time Scheduling | 184 | ||
9.2.4 Method Generation for MS Acquisitions | 186 | ||
9.3 System Suitability Assessments | 188 | ||
9.4 Post-Acquisition Processing and Data Analysis | 188 | ||
9.4.1 mProphet False Discovery Analysis, Peak Detection and Peak Picking | 188 | ||
9.4.2 Data Viewing and Data Management: Custom Annotation, Results and Document Grids, Group Comparisons | 191 | ||
9.4.3 Data Reports, LOD–LOQ Calculations and Statistical Processing, Use of Skyline External Tools | 191 | ||
9.4.4 Group Comparisons and Peptide & Protein Quantification | 192 | ||
9.4.5 Easy Data Sharing and SRM Resources – Panorama | 193 | ||
9.5 Post-Translational Modifications and Protein Isoforms or Proteoforms | 193 | ||
9.6 Conclusion and Future Outlook | 195 | ||
Acknowledgements | 196 | ||
References | 196 | ||
Chapter 10 - Data Analysis for Data Independent Acquisition | 200 | ||
10.1 Analytical Methods | 200 | ||
10.1.1 Motivation | 200 | ||
10.1.2 Background: Other MS Methods | 201 | ||
10.1.3 DIA Concept | 202 | ||
10.1.4 Theoretical Considerations | 204 | ||
10.1.5 Main DIA Methods | 207 | ||
10.1.5.1 PRM | 207 | ||
10.1.5.2 MSE/HDMSE/AIF | 209 | ||
10.1.5.3 PAcIFIC | 209 | ||
10.1.5.4 SWATH-MS | 210 | ||
10.1.5.5 MSX | 210 | ||
10.1.6 Analyte Separation Methods | 210 | ||
10.2 Data Analysis Methods | 212 | ||
10.2.1 DIA Data Analysis | 212 | ||
10.2.2 Untargeted Analysis, Spectrum-Centric | 213 | ||
10.2.2.1 Signal Clustering | 213 | ||
10.2.2.2 Pseudo-Spectra Identification | 214 | ||
10.2.2.3 Peptide and Protein Quantification | 215 | ||
10.2.3 Targeted Analysis, Chromatogram-Centric | 215 | ||
10.2.3.1 Retention Time Normalisation | 216 | ||
10.2.3.2 Chromatogram Extraction | 217 | ||
10.2.3.3 Peak Group Scoring | 218 | ||
10.2.3.3.1\rPeak Picking.The aim of peak picking is to identify potential peak candidates (points of elution) for each peptide in the fragme... | 218 | ||
10.2.3.3.2\rPeak Scoring.The algorithm next operates on the peak group candidates found in the previous step and computes a set of scores fo... | 218 | ||
10.2.3.4 Peak Quantification | 219 | ||
10.2.3.5 Error Rate Estimation | 219 | ||
10.2.3.6 Alignment | 219 | ||
10.2.4 FDR | 220 | ||
10.2.5 Results and Formats | 222 | ||
10.3 Challenges | 223 | ||
References | 224 | ||
Section III - Open Source Software Environments for Proteome Informatics | 229 | ||
Chapter 11 - Data Formats of the Proteomics Standards Initiative | 231 | ||
11.1 Introduction | 231 | ||
11.2 mzML | 233 | ||
11.2.1 Data Format | 233 | ||
11.2.2 Software Implementations | 235 | ||
11.2.3 Current Work | 237 | ||
11.2.4 Variations of mzML | 237 | ||
11.3 mzIdentML | 238 | ||
11.3.1 Data Format | 238 | ||
11.3.2 Software Implementations | 241 | ||
11.3.3 Current Work | 242 | ||
11.4 mzQuantML | 242 | ||
11.4.1 Data Format | 242 | ||
11.4.2 Software Implementations | 245 | ||
11.4.3 Current Work | 245 | ||
11.5 mzTab | 246 | ||
11.5.1 Data Format | 246 | ||
11.5.2 Software Implementations | 248 | ||
11.5.3 Current Work | 249 | ||
11.6 TraML | 249 | ||
11.6.1 Data Format | 249 | ||
11.6.2 Software Implementations | 251 | ||
11.7 Other Data Standard Formats Produced by the PSI | 251 | ||
11.8 Conclusions | 252 | ||
Abbreviations | 252 | ||
Acknowledgements | 253 | ||
References | 253 | ||
Chapter 12 - OpenMS: A Modular, Open-Source Workflow System for the Analysis of Quantitative Proteomics Data | 259 | ||
12.1 Introduction | 259 | ||
12.2 Peptide Identification | 262 | ||
12.3 iTRAQ Labeling | 266 | ||
12.4 Dimethyl Labeling | 270 | ||
12.5 Label-Free Quantification | 275 | ||
12.6 Conclusion | 279 | ||
Acknowledgements | 282 | ||
References | 282 | ||
Chapter 13 - Using Galaxy for Proteomics | 289 | ||
13.1 Introduction | 289 | ||
13.2 The Galaxy Framework as a Solution for MS-Based Proteomic Informatics | 291 | ||
13.2.1 The Web-Based User Interface | 291 | ||
13.2.2 Galaxy Histories | 293 | ||
13.2.3 Galaxy Workflows | 293 | ||
13.2.4 Sharing Histories and Workflows in Galaxy | 296 | ||
13.3 Extending Galaxy for New Data Analysis Applications | 296 | ||
13.3.1 Deploying Software as a Galaxy Tool | 296 | ||
13.3.2 Galaxy Plugins and Visualization | 299 | ||
13.4 Publishing Galaxy Extensions | 300 | ||
13.5 Scaling Galaxy for Operation on High Performance Systems | 300 | ||
13.6 Windows-Only Applications in a Linux World | 301 | ||
13.7 MS-Based Proteomic Applications in Galaxy | 302 | ||
13.7.1 Raw Data Conversion and Pre-Processing | 302 | ||
13.7.2 Generation of a Reference Protein Sequence Database | 304 | ||
13.7.3 Sequence Database Searching | 304 | ||
13.7.4 Results Filtering and Visualization | 305 | ||
13.8 Integrating the ‘-omic’ Domains: Multi-Omic Applications in Galaxy | 306 | ||
13.8.1 Building Proteogenomic Workflows in Galaxy | 309 | ||
13.8.2 Metaproteomics Applications in Galaxy | 313 | ||
13.9 Concluding Thoughts and Future Directions | 315 | ||
Acknowledgements | 317 | ||
References | 317 | ||
Chapter 14 - R for Proteomics | 321 | ||
14.1 Introduction | 321 | ||
14.2 Accessing Data | 323 | ||
14.2.1 Data Packages | 323 | ||
14.2.2 Data from the ProteomeXchange Repository | 324 | ||
14.2.3 Cloud Infrastructure | 325 | ||
14.3 Reading and Handling Mass Spectrometry and Proteomics Data | 326 | ||
14.3.1 Raw Data | 326 | ||
14.3.2 Identification Data | 327 | ||
14.3.3 Quantitative Data | 329 | ||
14.3.4 Imaging Data | 330 | ||
14.3.5 Conclusion | 330 | ||
14.4 MSMS Identifications | 330 | ||
14.4.1 Introduction | 330 | ||
14.4.2 The MSGFplus Package | 331 | ||
14.4.3 The MSGFgui Package | 332 | ||
14.4.4 The rTANDEM Package | 334 | ||
14.4.5 The MSnID Package | 335 | ||
14.4.6 Example | 338 | ||
14.5 Analysis of Spectral Counting Data | 339 | ||
14.5.1 Introduction | 339 | ||
14.5.2 Exploratory Data Analysis with msmsEDA | 339 | ||
14.5.3 Statistical Analyses with msmsTests | 341 | ||
14.5.4 Example | 342 | ||
14.6 MALDI and Mass Spectrometry Imaging | 342 | ||
14.6.1 Introduction | 342 | ||
14.6.2 MALDI Pre-Processing Using MALDIquant | 343 | ||
14.6.2.1 Import Raw Data | 343 | ||
14.6.2.2 Intensity Transformation and Smoothing | 345 | ||
14.6.2.3 Baseline Correction | 345 | ||
14.6.2.4 Intensity Calibration | 346 | ||
14.6.2.5 Peak Detection | 347 | ||
14.6.2.6 Peak Alignment | 347 | ||
14.6.2.7 Peak Binning | 347 | ||
14.6.2.8 Conclusion | 348 | ||
14.6.3 Mass Spectrometry Imaging | 348 | ||
14.6.3.1 Cardinal | 348 | ||
14.6.3.2 MALDIquant | 350 | ||
14.6.3.3 Conclusion | 350 | ||
14.7 Isobaric Tagging and Quantitative Data Processing | 350 | ||
14.7.1 Quantification of Isobaric Data Experiments | 351 | ||
14.7.2 Processing Quantitative Proteomics Data | 351 | ||
14.8 Machine Learning, Statistics and Applications | 352 | ||
14.8.1 Introduction | 352 | ||
14.8.2 Statistics | 352 | ||
14.8.3 Machine Learning | 354 | ||
14.8.3.1 Supervised Machine Learning | 355 | ||
14.8.3.2 Unsupervised Machine Learning | 358 | ||
14.8.4 Conclusion | 358 | ||
14.9 Conclusions | 359 | ||
References | 359 | ||
Section IV - Integration of Proteomics and Other Data | 365 | ||
Chapter 15 - Proteogenomics: Proteomics for Genome Annotation | 367 | ||
15.1 Introduction | 367 | ||
15.2 Theoretical Underpinning | 370 | ||
15.2.1 Gene Prediction | 371 | ||
15.2.1.1 Ab initio Gene Prediction | 371 | ||
15.2.1.2 Evidence-Based Gene Prediction | 371 | ||
15.2.2 Protein and Peptide Identification | 372 | ||
15.2.3 Design of Protein Sequence Databases | 372 | ||
15.2.4 Output of Proteogenomics Pipelines | 375 | ||
15.2.4.1 Statistics and False Discovery Rate Calculation | 376 | ||
15.3 Proteogenomics Platforms | 377 | ||
15.3.1 Gene Prediction Pipelines | 377 | ||
15.3.2 Proteogenomics Pipelines | 378 | ||
15.3.3 Proteomics Data Repositories for Proteogenomics | 378 | ||
15.3.4 Visualisation | 379 | ||
15.3.5 Data Formats and Standards | 380 | ||
15.4 Challenges and Future Research | 381 | ||
15.5 Summary | 381 | ||
References | 382 | ||
Chapter 16 - Proteomics Informed by Transcriptomics | 385 | ||
16.1 Introduction to PIT | 385 | ||
16.2 Creation of Protein Database from RNA-Seq Data | 388 | ||
16.2.1 Introduction to RNA-Seq | 388 | ||
16.2.2 Sequence Assembly | 391 | ||
16.2.3 ORF Finding | 392 | ||
16.2.4 Finalising Protein Sequence Data for PIT Search | 393 | ||
16.3 Interpretation of Identified ORFs | 393 | ||
16.3.1 Identification of Proteins in the Absence of a Reference Genome | 394 | ||
16.3.2 Identification of Individual Sequence Variation | 394 | ||
16.3.3 Monitoring Isoform Switching | 397 | ||
16.3.4 Genome Annotation and Discovery of Novel Translated Genomic Elements | 400 | ||
16.4 Reporting and Storing PIT Results | 400 | ||
16.5 Applications of PIT | 401 | ||
16.6 Conclusions | 402 | ||
Acknowledgements | 402 | ||
References | 402 | ||
Subject Index | 406 |