Proteome Informatics

Conrad Bessant

BOOK

Proteome Informatics

Conrad Bessant

(2016)

Additional Information

Book Details

ISBN: 978-1-78262-673-2
Edition: 1
Language: English
Pages: 412
Subjects: Analytical chemistry
Spectrum analysis, spectrochemistry, mass spectrometry
Biology, life sciences
Biochemistry
Chemical biology

Abstract

The field of proteomics has developed rapidly over the past decade nurturing the need for a detailed introduction to the various informatics topics that underpin the main liquid chromatography tandem mass spectrometry (LC-MS/MS) protocols used for protein identification and quantitation. Proteins are a key component of any biological system, and monitoring proteins using LC-MS/MS proteomics is becoming commonplace in a wide range of biological research areas. However, many researchers treat proteomics software tools as a black box, drawing conclusions from the output of such tools without considering the nuances and limitations of the algorithms on which such software is based. This book seeks to address this situation by bringing together world experts to provide clear explanations of the key algorithms, workflows and analysis frameworks, so that users of proteomics data can be confident that they are using appropriate tools in suitable ways.
This is a timely book for the proteomics researcher in guiding decision trees involved in the informatics pipelines of this rapidly developing field. The book does an excellent job in focusing on that part of the informatics discipline, both at the user interface and behind the scenes algorithms, that is of great importance to developing innovative and effective methods for protein interrogation.
Taufika Islam Williams
Conrad Bessant is Professor of Bioinformatics at Queen Mary University of London. He has particular interests in proteomics, software development and machine learning and is striving to ensure that everyone using proteomics data can access the latest analysis methods and knows how to use them in the most effective way.

Section Title	Page	Action	Price
Cover	Cover
Proteome Informatics	i
Acknowledgements	v
Contents	vii
Chapter 1 - Introduction to Proteome Informatics	1
1.1 Introduction	1
1.2 Principles of LC-MS/MS Proteomics	3
1.2.1 Protein Fundamentals	3
1.2.2 Shotgun Proteomics	5
1.2.3 Separation of Peptides by Chromatography	6
1.2.4 Mass Spectrometry	6
1.3 Identification of Peptides and Proteins	8
1.4 Protein Quantitation	9
1.5 Applications and Downstream Analysis	9
1.6 Proteomics Software	10
1.6.1 Proteomics Data Standards and Databases	11
1.7 Conclusions	12
Acknowledgements	12
References	12
Section I - Protein Identification	15
Chapter 2 - De novo Peptide Sequencing	17
2.1 Introduction	17
2.2 Manual De novo Sequencing	18
2.3 Computer Algorithms	20
2.3.1 Search Tree Pruning	20
2.3.2 Spectrum Graph	21
2.3.3 PEAKS Algorithm	24
2.4 Scoring Function	26
2.4.1 Likelihood Ratio	27
2.4.2 Utilization of Many Ion Types	28
2.4.3 Combined Use of Different Fragmentations	28
2.4.4 Machine Learning	29
2.4.5 Amino Acid Score	30
2.5 Computer Software	31
2.5.1 Lutefisk	31
2.5.2 Sherenga	31
2.5.3 PEAKS	31
2.5.4 PepNovo	32
2.5.5 DACSIM	32
2.5.6 NovoHMM	32
2.5.7 MSNovo	32
2.5.8 PILOT	32
2.5.9 pNovo	33
2.5.10 Novor	33
2.6 Conclusion: Applications and Limitations of De novo Sequencing	33
2.6.1 Sequencing Novel Peptides and Detecting Mutated Peptides	33
2.6.2 Assisting Database Search	34
2.6.3 De novo Protein Sequencing	34
2.6.4 Unspecified PTM Characterization	34
2.6.5 Limitations	35
Acknowledgements	35
References	36
Chapter 3 - Peptide Spectrum Matching via Database Search and Spectral Library Search	39
3.1 Introduction	39
3.2 Protein Sequence Databases	41
3.3 Overview of Shotgun Proteomics Method	43
3.4 Collision Induced Dissociation Fragments Peptides in Predictable Ways	44
3.5 Overview of Database Searching	45
3.6 MyriMatch Database Search Engine	47
3.6.1 Spectrum Preparation	48
3.6.2 Peptide Harvesting from Database	49
3.6.3 Comparing Experimental MS/MS with Candidate Peptide Sequences	49
3.7 Accounting for Post-Translational Modifications During Database Search	52
3.8 Reporting of Database Search Peptide Identifications	53
3.9 Spectral Library Search Concept	55
3.10 Peptide Spectral Libraries	56
3.11 Overview of Spectral Library Searching	58
3.12 Pepitome Spectral Library Search Engine	59
3.12.1 Experimental MS2 Spectrum Preparation	60
3.12.2 Library Spectrum Harvesting and Spectrum–Spectrum Matching	60
3.12.3 Results Reporting	62
3.13 Search Results Vary Between Various Database Search Engines and Different Peptide Identification Search Strategies	62
3.14 Conclusion	63
References	64
Chapter 4 - PSM Scoring and Validation	69
4.1 Introduction	69
4.2 Statistical Scores and What They Mean	71
4.2.1 Statistical Probability p-Values and Multiple Testing	72
4.2.2 Expectation Scores	72
4.2.3 False Discovery Rates	73
4.2.4 q-Values	74
4.2.5 Posterior Error Probability	75
4.2.6 Which Statistical Measure to Use and When	75
4.2.7 Target Decoy Approaches for FDR Assessment	77
4.3 Post-Search Validation Tools and Methods	80
4.3.1 Qvality	80
4.3.2 PeptideProphet	81
4.3.3 Percolator	81
4.3.4 Mass Spectrometry Generating Function	82
4.3.5 Nokoi	83
4.3.6 PepDistiller	83
4.3.7 Integrated Workflow and Pipeline Analysis Tools	83
4.3.8 Developer Libraries	84
4.4 Common Pitfalls and Problems in Statistical Analysis of Proteomics Data	84
4.4.1 Target-Decoy Peptide Assumptions	84
4.4.2 Peptide Modifications	85
4.4.3 Search Space Size	86
4.4.4 Distinct Peptides and Proteins	87
4.5 Conclusion and Future Trends	88
References	88
Chapter 5 - Protein Inference and Grouping	93
5.1 Background	93
5.1.1 Assignment of Peptides to Proteins	95
5.1.2 Protein Groups and Families	97
5.2 Theoretical Solutions and Protein Scoring	100
5.2.1 Protein Grouping Based on Sets of Peptides	100
5.2.2 Spectral-Focussed Inference Approaches	102
5.2.3 Considerations of Protein Length	104
5.2.4 Handling Sub-Set and Same-Set Proteins within Groups	105
5.2.5 Assignment of Representative or Group Leader Proteins	108
5.2.6 Importance of Peptide Classification to Quantitative Approaches	108
5.2.7 Scoring or Probability Assignment at the Protein-Level	109
5.2.8 Handling “One Hit Wonders”	111
5.3 Support for Protein Grouping in Data Standards	112
5.4 Conclusions	113
Acknowledgements	114
References	114
Chapter 6 - Identification and Localization of Post-Translational Modifications by High-Resolution Mass Spectrometry	116
6.1 Introduction	116
6.2 Sample Preparation Challenges	118
6.3 Identification and Localization of Post-Translational Modifications	120
6.3.1 Computational Challenges	120
6.3.2 Annotation of Modifications	122
6.3.3 Common Post-Translational Modifications Identified by Mass Spectrometry	123
6.3.4 Validation of Results	124
6.4 Conclusion	129
Acknowledgements	129
References	129
Section II - Protein Quantitation	133
Chapter 7 - Algorithms for MS1-Based Quantitation	135
7.1 Introduction	135
7.2 Feature Detection and Quantitation	137
7.2.1 Conventional Feature Detection	138
7.2.2 Recent Approaches Based on Sparsity and Mixture Modelling	140
7.3 Chromatogram Alignment	142
7.3.1 Feature-Based Pattern Matching	143
7.3.2 Raw Profile Alignment	143
7.4 Abundance Normalisation	146
7.5 Protein-Level Differential Quantification	147
7.5.1 Statistical Methods	148
7.5.2 Statistical Models Accounting for Shared Peptides	151
7.6 Discussion	151
Acknowledgements	152
References	152
Chapter 8 - MS2-Based Quantitation	155
8.1 MS2-Based Quantification of Proteins	155
8.2 Spectral Counting	156
8.2.1 Implementations	158
8.2.2 Conclusion on Spectrum Counting	158
8.3 Reporter Ion-Based Quantification	161
8.3.1 Identification	164
8.3.2 Reporter Ion Intensities, Interferences and Deisotoping	165
8.3.3 Ratio Estimation and Normalization	168
8.3.4 Implementation	169
8.3.5 Conclusion on Reporter Ion-Based Quantification	173
Acknowledgements	175
References	175
Chapter 9 - Informatics Solutions for Selected Reaction Monitoring	178
9.1 Introduction	178
9.1.1 SRM – General Concept and Specific Bioinformatic Challenges	178
9.1.2 SRM-Specific Bioinformatics Tools	180
9.2 SRM Assay Development	182
9.2.1 Target and Transition Selection, Proteotypic and Quantotypic Peptides	182
9.2.2 Spikes of Isotopically Labeled Peptides and Protein Standards and Additional Assay Development Steps	183
9.2.3 Retention Time Regressions and Retention Time Scheduling	184
9.2.4 Method Generation for MS Acquisitions	186
9.3 System Suitability Assessments	188
9.4 Post-Acquisition Processing and Data Analysis	188
9.4.1 mProphet False Discovery Analysis, Peak Detection and Peak Picking	188
9.4.2 Data Viewing and Data Management: Custom Annotation, Results and Document Grids, Group Comparisons	191
9.4.3 Data Reports, LOD–LOQ Calculations and Statistical Processing, Use of Skyline External Tools	191
9.4.4 Group Comparisons and Peptide & Protein Quantification	192
9.4.5 Easy Data Sharing and SRM Resources – Panorama	193
9.5 Post-Translational Modifications and Protein Isoforms or Proteoforms	193
9.6 Conclusion and Future Outlook	195
Acknowledgements	196
References	196
Chapter 10 - Data Analysis for Data Independent Acquisition	200
10.1 Analytical Methods	200
10.1.1 Motivation	200
10.1.2 Background: Other MS Methods	201
10.1.3 DIA Concept	202
10.1.4 Theoretical Considerations	204
10.1.5 Main DIA Methods	207
10.1.5.1 PRM	207
10.1.5.2 MSE/HDMSE/AIF	209
10.1.5.3 PAcIFIC	209
10.1.5.4 SWATH-MS	210
10.1.5.5 MSX	210
10.1.6 Analyte Separation Methods	210
10.2 Data Analysis Methods	212
10.2.1 DIA Data Analysis	212
10.2.2 Untargeted Analysis, Spectrum-Centric	213
10.2.2.1 Signal Clustering	213
10.2.2.2 Pseudo-Spectra Identification	214
10.2.2.3 Peptide and Protein Quantification	215
10.2.3 Targeted Analysis, Chromatogram-Centric	215
10.2.3.1 Retention Time Normalisation	216
10.2.3.2 Chromatogram Extraction	217
10.2.3.3 Peak Group Scoring	218
10.2.3.3.1\rPeak Picking.The aim of peak picking is to identify potential peak candidates (points of elution) for each peptide in the fragme...	218
10.2.3.3.2\rPeak Scoring.The algorithm next operates on the peak group candidates found in the previous step and computes a set of scores fo...	218
10.2.3.4 Peak Quantification	219
10.2.3.5 Error Rate Estimation	219
10.2.3.6 Alignment	219
10.2.4 FDR	220
10.2.5 Results and Formats	222
10.3 Challenges	223
References	224
Section III - Open Source Software Environments for Proteome Informatics	229
Chapter 11 - Data Formats of the Proteomics Standards Initiative	231
11.1 Introduction	231
11.2 mzML	233
11.2.1 Data Format	233
11.2.2 Software Implementations	235
11.2.3 Current Work	237
11.2.4 Variations of mzML	237
11.3 mzIdentML	238
11.3.1 Data Format	238
11.3.2 Software Implementations	241
11.3.3 Current Work	242
11.4 mzQuantML	242
11.4.1 Data Format	242
11.4.2 Software Implementations	245
11.4.3 Current Work	245
11.5 mzTab	246
11.5.1 Data Format	246
11.5.2 Software Implementations	248
11.5.3 Current Work	249
11.6 TraML	249
11.6.1 Data Format	249
11.6.2 Software Implementations	251
11.7 Other Data Standard Formats Produced by the PSI	251
11.8 Conclusions	252
Abbreviations	252
Acknowledgements	253
References	253
Chapter 12 - OpenMS: A Modular, Open-Source Workflow System for the Analysis of Quantitative Proteomics Data	259
12.1 Introduction	259
12.2 Peptide Identification	262
12.3 iTRAQ Labeling	266
12.4 Dimethyl Labeling	270
12.5 Label-Free Quantification	275
12.6 Conclusion	279
Acknowledgements	282
References	282
Chapter 13 - Using Galaxy for Proteomics	289
13.1 Introduction	289
13.2 The Galaxy Framework as a Solution for MS-Based Proteomic Informatics	291
13.2.1 The Web-Based User Interface	291
13.2.2 Galaxy Histories	293
13.2.3 Galaxy Workflows	293
13.2.4 Sharing Histories and Workflows in Galaxy	296
13.3 Extending Galaxy for New Data Analysis Applications	296
13.3.1 Deploying Software as a Galaxy Tool	296
13.3.2 Galaxy Plugins and Visualization	299
13.4 Publishing Galaxy Extensions	300
13.5 Scaling Galaxy for Operation on High Performance Systems	300
13.6 Windows-Only Applications in a Linux World	301
13.7 MS-Based Proteomic Applications in Galaxy	302
13.7.1 Raw Data Conversion and Pre-Processing	302
13.7.2 Generation of a Reference Protein Sequence Database	304
13.7.3 Sequence Database Searching	304
13.7.4 Results Filtering and Visualization	305
13.8 Integrating the ‘-omic’ Domains: Multi-Omic Applications in Galaxy	306
13.8.1 Building Proteogenomic Workflows in Galaxy	309
13.8.2 Metaproteomics Applications in Galaxy	313
13.9 Concluding Thoughts and Future Directions	315
Acknowledgements	317
References	317
Chapter 14 - R for Proteomics	321
14.1 Introduction	321
14.2 Accessing Data	323
14.2.1 Data Packages	323
14.2.2 Data from the ProteomeXchange Repository	324
14.2.3 Cloud Infrastructure	325
14.3 Reading and Handling Mass Spectrometry and Proteomics Data	326
14.3.1 Raw Data	326
14.3.2 Identification Data	327
14.3.3 Quantitative Data	329
14.3.4 Imaging Data	330
14.3.5 Conclusion	330
14.4 MSMS Identifications	330
14.4.1 Introduction	330
14.4.2 The MSGFplus Package	331
14.4.3 The MSGFgui Package	332
14.4.4 The rTANDEM Package	334
14.4.5 The MSnID Package	335
14.4.6 Example	338
14.5 Analysis of Spectral Counting Data	339
14.5.1 Introduction	339
14.5.2 Exploratory Data Analysis with msmsEDA	339
14.5.3 Statistical Analyses with msmsTests	341
14.5.4 Example	342
14.6 MALDI and Mass Spectrometry Imaging	342
14.6.1 Introduction	342
14.6.2 MALDI Pre-Processing Using MALDIquant	343
14.6.2.1 Import Raw Data	343
14.6.2.2 Intensity Transformation and Smoothing	345
14.6.2.3 Baseline Correction	345
14.6.2.4 Intensity Calibration	346
14.6.2.5 Peak Detection	347
14.6.2.6 Peak Alignment	347
14.6.2.7 Peak Binning	347
14.6.2.8 Conclusion	348
14.6.3 Mass Spectrometry Imaging	348
14.6.3.1 Cardinal	348
14.6.3.2 MALDIquant	350
14.6.3.3 Conclusion	350
14.7 Isobaric Tagging and Quantitative Data Processing	350
14.7.1 Quantification of Isobaric Data Experiments	351
14.7.2 Processing Quantitative Proteomics Data	351
14.8 Machine Learning, Statistics and Applications	352
14.8.1 Introduction	352
14.8.2 Statistics	352
14.8.3 Machine Learning	354
14.8.3.1 Supervised Machine Learning	355
14.8.3.2 Unsupervised Machine Learning	358
14.8.4 Conclusion	358
14.9 Conclusions	359
References	359
Section IV - Integration of Proteomics and Other Data	365
Chapter 15 - Proteogenomics: Proteomics for Genome Annotation	367
15.1 Introduction	367
15.2 Theoretical Underpinning	370
15.2.1 Gene Prediction	371
15.2.1.1 Ab initio Gene Prediction	371
15.2.1.2 Evidence-Based Gene Prediction	371
15.2.2 Protein and Peptide Identification	372
15.2.3 Design of Protein Sequence Databases	372
15.2.4 Output of Proteogenomics Pipelines	375
15.2.4.1 Statistics and False Discovery Rate Calculation	376
15.3 Proteogenomics Platforms	377
15.3.1 Gene Prediction Pipelines	377
15.3.2 Proteogenomics Pipelines	378
15.3.3 Proteomics Data Repositories for Proteogenomics	378
15.3.4 Visualisation	379
15.3.5 Data Formats and Standards	380
15.4 Challenges and Future Research	381
15.5 Summary	381
References	382
Chapter 16 - Proteomics Informed by Transcriptomics	385
16.1 Introduction to PIT	385
16.2 Creation of Protein Database from RNA-Seq Data	388
16.2.1 Introduction to RNA-Seq	388
16.2.2 Sequence Assembly	391
16.2.3 ORF Finding	392
16.2.4 Finalising Protein Sequence Data for PIT Search	393
16.3 Interpretation of Identified ORFs	393
16.3.1 Identification of Proteins in the Absence of a Reference Genome	394
16.3.2 Identification of Individual Sequence Variation	394
16.3.3 Monitoring Isoform Switching	397
16.3.4 Genome Annotation and Discovery of Novel Translated Genomic Elements	400
16.4 Reporting and Storing PIT Results	400
16.5 Applications of PIT	401
16.6 Conclusions	402
Acknowledgements	402
References	402
Subject Index	406

My Library

Proteome Informatics

Additional Information

Book Details

Abstract

Table of Contents

Contact Us

Quick Navigation