Proteogenomics 101: a primer on database search strategies

Springer Science and Business Media LLC - Tập 14 Số 4 - Trang 287-301
Anurag Raj1, Suruchi Aggarwal2, Dhirendra Kumar1, Amit Kumar Yadav2, Debasis Dash3
1G. N. Ramachandran Knowledge Centre for Genomics Informatics, CSIR – Institute of Genomics and Integrative Biology, New Delhi, 110025, India
2Computational and Mathematical Biology Centre (CMBC), Translational Health Science and Technology Institute (THSTI), NCR Biotech Science Cluster, 3rd Milestone, Faridabad-Gurgaon Expressway, Faridabad, Haryana, 121001, India
3Academy of Scientific and Innovative Research (AcSIR), Ghaziabad 201002, India

Tóm tắt

Từ khóa


Tài liệu tham khảo

Aebersold R, Mann M (2003) Mass spectrometry-based proteomics. Nature 422(6928):198–207. https://doi.org/10.1038/nature01511

Aggarwal S, Yadav AK (2016) False discovery rate estimation in proteomics. Methods Mol Biol 1362:119–128. https://doi.org/10.1007/978-1-4939-3106-4_7

Aggarwal S, Raj A, Kumar D, Dash D, Yadav AK (2022) False discovery rate: the Achilles’ heel of proteogenomics. Brief Bioinform. https://doi.org/10.1093/bib/bbac163

Aggarwal S, Gupta P, Dhawan U, Yadav AK (2023) Chapter 8—The language of posttranslational modifications and deciphering it from proteomics data. In: Garg M, Sethi G, Pandey AK (eds) Transcription and translation in health and disease. Academic Press, pp 109–136

Armengaud J (2010) Proteogenomics and systems biology: quest for the ultimate missing parts. Expert Rev Proteomics 7(1):65–77. https://doi.org/10.1586/epr.09.104

Askenazi M, Ruggles KV, Fenyo D (2016) PGx: putting peptides to BED. J Proteome Res 15(3):795–799. https://doi.org/10.1021/acs.jproteome.5b00870

Babele P, Yadav AK (2023) Back2Basics: mass-to-charge ratio (m/z) in proteomics. J Proteins Proteomics. https://doi.org/10.1007/s42485-023-00115-7

Barsnes H, Vaudel M (2018) SearchGUI: a highly adaptable common interface for proteomics search and de novo engines. J Proteome Res 17(7):2552–2555. https://doi.org/10.1021/acs.jproteome.8b00175

Bern MW, Kil YJ (2011) Two-dimensional target decoy strategy for shotgun proteomics. J Proteome Res 10(12):5296–5301. https://doi.org/10.1021/pr200780j

Binz PA, Shofstahl J, Vizcaino JA, Barsnes H, Chalkley RJ, Menschaert G et al (2019) Proteomics standards initiative extended FASTA format. J Proteome Res 18(6):2686–2692. https://doi.org/10.1021/acs.jproteome.9b00064

Bitton DA, Smith DL, Connolly Y, Scutt PJ, Miller CJ (2010) An integrated mass-spectrometry pipeline identifies novel protein coding-regions in the human genome. PLoS ONE 5(1):e8949. https://doi.org/10.1371/journal.pone.0008949

Blakeley P, Overton IM, Hubbard SJ (2012) Addressing statistical biases in nucleotide-derived protein databases for proteogenomic search strategies. J Proteome Res 11(11):5221–5234. https://doi.org/10.1021/pr300411q

Branca RM, Orre LM, Johansson HJ, Granholm V, Huss M, Perez-Bercoff A et al (2014) HiRIEF LC-MS enables deep proteome coverage and unbiased proteogenomics. Nat Methods 11(1):59–62. https://doi.org/10.1038/nmeth.2732

Cao X, Xing J (2021) PrecisionProDB: improving the proteomics performance for precision medicine. Bioinformatics 37(19):3361–3363. https://doi.org/10.1093/bioinformatics/btab218

Cao R, Shi Y, Chen S, Ma Y, Chen J, Yang J et al (2017) dbSAP: single amino-acid polymorphism database for protein variation detection. Nucleic Acids Res 45(D1):D827–D832. https://doi.org/10.1093/nar/gkw1096

Castellana NE, Payne SH, Shen Z, Stanke M, Bafna V, Briggs SP (2008) Discovery and revision of Arabidopsis genes by proteogenomics. Proc Natl Acad Sci USA 105(52):21034–21038. https://doi.org/10.1073/pnas.0811066106

Castellana NE, Pham V, Arnott D, Lill JR, Bafna V (2010) Template proteogenomics: sequencing whole proteins using an imperfect database. Mol Cell Proteomics 9(6):1260–1270. https://doi.org/10.1074/mcp.M900504-MCP200

Cesnik AJ, Miller RM, Ibrahim K, Lu L, Millikin RJ, Shortreed MR et al (2021) Spritz: a proteogenomic database engine. J Proteome Res 20(4):1826–1834. https://doi.org/10.1021/acs.jproteome.0c00407

Chen YJ, Roumeliotis TI, Chang YH, Chen CT, Han CL, Lin MH et al (2020) Proteogenomics of non-smoking lung cancer in East Asia delineates molecular signatures of pathogenesis and progression. Cell 182(1):226–44.e17. https://doi.org/10.1016/j.cell.2020.06.012

Choi S, Kim H, Paek E (2017) ACTG: novel peptide mapping onto gene models. Bioinformatics 33(8):1218–1220. https://doi.org/10.1093/bioinformatics/btw787

Consortium GT (2013) The Genotype-Tissue Expression (GTEx) project. Nat Genet 45(6):580–585. https://doi.org/10.1038/ng.2653

Cradick TJ, Qiu P, Lee CM, Fine EJ, Bao G (2014) COSMID: a web-based tool for identifying and validating CRISPR/Cas off-target sites. Mol Ther Nucleic Acids 3:e214. https://doi.org/10.1038/mtna.2014.64

Craig R, Beavis RC (2004) TANDEM: matching proteins with tandem mass spectra. Bioinformatics 20(9):1466–1467. https://doi.org/10.1093/bioinformatics/bth092

Crappe J, Ndah E, Koch A, Steyaert S, Gawron D, De Keulenaer S et al (2015) PROTEOFORMER: deep proteome coverage through ribosome profiling and MS integration. Nucleic Acids Res 43(5):e29. https://doi.org/10.1093/nar/gku1283

Da Cunha LM, Terrematte P, Fiuza TDS, Silva VLD, Kroll JE, De Souza SJ et al (2022) dbPepVar: a novel cancer proteogenomics database. IEEE Access 10:90982–90994. https://doi.org/10.1109/access.2022.3201897

Dutta S, Ghosh S, Mishra A, Ghosh R (2023) Oncoproteomics: insight into current proteomic technologies in cancer biomarker discovery and treatment. J Proteins Proteomics 14(1):1–24. https://doi.org/10.1007/s42485-022-00100-6

Elias JE, Gygi SP (2007) Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat Methods 4(3):207–214. https://doi.org/10.1038/nmeth1019

Eng JK, McCormack AL, Yates JR (1994) An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom 5(11):976–989. https://doi.org/10.1016/1044-0305(94)80016-2

Everett LJ, Bierl C, Master SR (2010) Unbiased statistical analysis for multi-stage proteomic search strategies. J Proteome Res 9(2):700–707. https://doi.org/10.1021/pr900256v

Fermin D, Allen BB, Blackwell TW, Menon R, Adamski M, Xu Y et al (2006) Novel gene and gene model detection using a whole genome open reading frame analysis in proteomics. Genome Biol 7(4):R35. https://doi.org/10.1186/gb-2006-7-4-r35

Frankish A, Diekhans M, Jungreis I, Lagarde J, Loveland JE, Mudge JM et al (2021) Gencode 2021. Nucleic Acids Res 49(D1):D916–D923. https://doi.org/10.1093/nar/gkaa1087

Fu Y, Qian X (2014) Transferred subgroup false discovery rate for rare post-translational modifications detected by mass spectrometry. Mol Cell Proteomics (MCP) 13(5):1359–1368. https://doi.org/10.1074/mcp.O113.030189

Gallien S, Perrodou E, Carapito C, Deshayes C, Reyrat JM, Van Dorsselaer A et al (2009) Ortho-proteogenomics: multiple proteomes investigation through orthology and a new MS-based protocol. Genome Res 19(1):128–135. https://doi.org/10.1101/gr.081901.108

Geer LY, Markey SP, Kowalak JA, Wagner L, Xu M, Maynard DM et al (2004) Open mass spectrometry search algorithm. J Proteome Res 3(5):958–964. https://doi.org/10.1021/pr0499491

Ghali F, Krishna R, Perkins S, Collins A, Xia D, Wastling J et al (2014) ProteoAnnotator—open source proteogenomics annotation software supporting PSI standards. Proteomics 14(23–24):2731–2741. https://doi.org/10.1002/pmic.201400265

Gonzalez-Gomariz J, Guruceaga E, Lopez-Sanchez M, Segura V (2019) Proteogenomics in the context of the Human Proteome Project (HPP). Expert Rev Proteomics 16(3):267–275. https://doi.org/10.1080/14789450.2019.1571916

Griss J, Perez-Riverol Y, Lewis S, Tabb DL, Dianes JA, Del-Toro N et al (2016) Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets. Nat Methods 13(8):651–656. https://doi.org/10.1038/nmeth.3902

Guillot L, Delage L, Viari A, Vandenbrouck Y, Com E, Ritter A et al (2019) Peptimapper: proteogenomics workflow for the expert annotation of eukaryotic genomes. BMC Genomics 20(1):56. https://doi.org/10.1186/s12864-019-5431-9

Guilloy N, Brunet MA, Leblanc S, Jacques JF, Hardy MP, Ehx G et al (2023) OpenCustomDB: integration of unannotated open reading frames and genetic variants to generate more comprehensive customized protein databases. J Proteome Res 22(5):1492–1500. https://doi.org/10.1021/acs.jproteome.3c00054

Has C, Allmer J (2017) PGMiner: Complete proteogenomics workflow; from data acquisition to result visualization. Inf Sci 384:126–134. https://doi.org/10.1016/j.ins.2016.08.005

He C, Jia C, Zhang Y, Xu P (2018) Enrichment-based proteogenomics identifies microproteins, missing proteins, and novel smORFs in Saccharomyces cerevisiae. J Proteome Res 17(7):2335–2344. https://doi.org/10.1021/acs.jproteome.8b00032

Hwang H, Park GW, Park JY, Lee HK, Lee JY, Jeong JE et al (2017) Next generation proteomic pipeline for chromosome-based proteomic research using NeXtProt and GENCODE databases. J Proteome Res 16(12):4425–4434. https://doi.org/10.1021/acs.jproteome.7b00223

Ivanov MV, Lobas AA, Karpov DS, Moshkovskii SA, Gorshkov MV (2017) Comparison of false discovery rate control strategies for variant peptide identifications in shotgun proteogenomics. J Proteome Res 16(5):1936–1943. https://doi.org/10.1021/acs.jproteome.6b01014

Ivanov MV, Lobas AA, Levitsky LI, Moshkovskii SA, Gorshkov MV (2018) Brute-force approach for mass spectrometry-based variant peptide identification in proteogenomics without personalized genomic data. J Am Soc Mass Spectrom 29(2):435–438. https://doi.org/10.1007/s13361-017-1859-9

Jaffe JD, Berg HC, Church GM (2004) Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics 4(1):59–77. https://doi.org/10.1002/pmic.200300511

Jagtap P, Goslinga J, Kooren JA, McGowan T, Wroblewski MS, Seymour SL et al (2013) A two-step database search method improves sensitivity in peptide sequence matches for metaproteomics and proteogenomics studies. Proteomics 13(8):1352–1357. https://doi.org/10.1002/pmic.201200352

Jagtap PD, Johnson JE, Onsongo G, Sadler FW, Murray K, Wang Y et al (2014) Flexible and accessible workflows for improved proteogenomic analysis using the Galaxy framework. J Proteome Res 13(12):5898–5908. https://doi.org/10.1021/pr500812t

Jeong SK, Kim CY, Paik YK (2018) ASV-ID, a Proteogenomic workflow to predict candidate protein isoforms on the basis of transcript evidence. J Proteome Res 17(12):4235–4242. https://doi.org/10.1021/acs.jproteome.8b00548

Jones AR, Siepen JA, Hubbard SJ, Paton NW (2009) Improving sensitivity in proteome studies by analysis of false discovery rates for multiple search engines. Proteomics 9(5):1220–1229. https://doi.org/10.1002/pmic.200800473

Kall L, Storey JD, MacCoss MJ, Noble WS (2008) Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. J Proteome Res 7(1):29–34. https://doi.org/10.1021/pr700600n

Kelkar S, Kumar D, Kumar P, Balakrishnan L, Muthusamy B, Yadav AK et al (2011) Proteogenomic analysis of Mycobacterium tuberculosis by high resolution mass spectrometry. Mol Cell Proteomics (MCP) 10(12):M111.011627. https://doi.org/10.1074/mcp.M111.011445

Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM et al (2002) The human genome browser at UCSC. Genome Res 12(6):996–1006. https://doi.org/10.1101/gr.229102

Khatun J, Yu Y, Wrobel JA, Risk BA, Gunawardena HP, Secrest A et al (2013) Whole human genome proteogenomic mapping for ENCODE cell line data: identifying protein-coding regions. BMC Genomics 14:141. https://doi.org/10.1186/1471-2164-14-141

Kim S, Pevzner PA (2014) MS-GF+ makes progress towards a universal database search tool for proteomics. Nat Commun 5:5277. https://doi.org/10.1038/ncomms6277

Kim MS, Pinto SM, Getnet D, Nirujogi RS, Manda SS, Chaerkady R et al (2014) A draft map of the human proteome. Nature 509(7502):575–581. https://doi.org/10.1038/nature13302

Kim H, Park H, Paek E (2015) NextSearch: a search engine for mass spectrometry data against a compact nucleotide exon graph. J Proteome Res 14(7):2784–2791. https://doi.org/10.1021/acs.jproteome.5b00047

Koch A, Gawron D, Steyaert S, Ndah E, Crappe J, De Keulenaer S et al (2014) A proteogenomics approach integrating proteomics and ribosome profiling increases the efficiency of protein identification and enables the discovery of alternative translation start sites. Proteomics 14(23–24):2688–2698. https://doi.org/10.1002/pmic.201400180

Kolmogorov M, Liu X, Pevzner PA (2016) SpectroGene: a tool for proteogenomic annotations using top-down spectra. J Proteome Res 15(1):144–151. https://doi.org/10.1021/acs.jproteome.5b00610

Kou Q, Xun L, Liu X (2016) TopPIC: a software tool for top-down mass spectrometry-based proteoform identification and characterization. Bioinformatics 32(22):3495–3497. https://doi.org/10.1093/bioinformatics/btw398

Kou Q, Wu S, Tolic N, Pasa-Tolic L, Liu Y, Liu X (2017) A mass graph-based approach for the identification of modified proteoforms using top-down tandem mass spectra. Bioinformatics 33(9):1309–1316. https://doi.org/10.1093/bioinformatics/btw806

Kroll JE, da Silva VL, de Souza SJ, de Souza GA (2017) A tool for integrating genetic and mass spectrometry-based peptide data: Proteogenomics Viewer: PV: a genome browser-like tool, which includes MS data visualization and peptide identification parameters. BioEssays. https://doi.org/10.1002/bies.201700015

Krzywinski M, Schein J, Birol I, Connors J, Gascoyne R, Horsman D et al (2009) Circos: an information aesthetic for comparative genomics. Genome Res 19(9):1639–1645. https://doi.org/10.1101/gr.092759.109

Kuhring M, Renard BY (2012) iPiG: integrating peptide spectrum matches into genome browser visualizations. PLoS ONE 7(12):e50246. https://doi.org/10.1371/journal.pone.0050246

Kumar D, Dash D (2016) Proteogenomic tools and approaches to explore protein coding landscapes of eukaryotic genomes. Adv Exp Med Biol 926:1–10. https://doi.org/10.1007/978-3-319-42316-6_1

Kumar D, Yadav AK, Kadimi PK, Nagaraj SH, Grimmond SM, Dash D (2013) Proteogenomic analysis of Bradyrhizobium japonicum USDA110 using GenoSuite, an automated multi-algorithmic pipeline. Mol Cell Proteomics 12(11):3388–3397. https://doi.org/10.1074/mcp.M112.027169

Kumar D, Mondal AK, Yadav AK, Dash D (2014) Discovery of rare protein-coding genes in model methylotroph Methylobacterium extorquens AM1. Proteomics 14(23–24):2790–2794. https://doi.org/10.1002/pmic.201400153

Kumar D, Jain A, Dash D (2015) Probing the missing human proteome: a computational perspective. J Proteome Res 14(12):4949–4958. https://doi.org/10.1021/acs.jproteome.5b00728

Kumar D, Yadav AK, Jia X, Mulvenna J, Dash D (2016) Integrated transcriptomic-proteomic analysis using a proteogenomic workflow refines rat genome annotation. Mol Cell Proteomics (MCP) 15(1):329–339. https://doi.org/10.1074/mcp.M114.047126

Kumar D, Yadav AK, Dash D (2017) Choosing an optimal database for protein identification from tandem mass spectrometry data. Methods Mol Biol 1549:17–29. https://doi.org/10.1007/978-1-4939-6740-7_3

Kwok N, Aretz Z, Takao S, Ser Z, Cifani P, Kentsis A (2023) Integrative proteogenomics using ProteomeGenerator2. J Proteome Res 22(8):2750–2764. https://doi.org/10.1021/acs.jproteome.3c00005

Kwon T, Choi H, Vogel C, Nesvizhskii AI, Marcotte EM (2011) MSblender: A probabilistic approach for integrating peptide identifications from multiple database search engines. J Proteome Res 10(7):2949–2958. https://doi.org/10.1021/pr2002116

Lau E, Han Y, Williams DR, Thomas CT, Shrestha R, Wu JC et al (2019) Splice-junction-based mapping of alternative isoforms in the human proteome. Cell Rep 29(11):3751–65.e5. https://doi.org/10.1016/j.celrep.2019.11.026

Lee SE, Song J, Bosl K, Muller AC, Vitko D, Bennett KL et al (2018) Proteogenomic analysis to identify missing proteins from haploid cell lines. Proteomics 18(8):e1700386. https://doi.org/10.1002/pmic.201700386

Li J, Su Z, Ma ZQ, Slebos RJ, Halvey P, Tabb DL et al (2011) A bioinformatics workflow for variant peptide detection in shotgun proteomics. Mol Cell Proteomics (MCP) 10(5):M110.006536. https://doi.org/10.1074/mcp.M110.006536

Li Y, Wang X, Cho JH, Shaw TI, Wu Z, Bai B et al (2016a) JUMPg: an integrative proteogenomics pipeline identifying unannotated proteins in human brain and cancer cells. J Proteome Res 15(7):2309–2320. https://doi.org/10.1021/acs.jproteome.6b00344

Li H, Joh YS, Kim H, Paek E, Lee SW, Hwang KB (2016b) Evaluating the effect of database inflation in proteogenomic search on sensitive and reliable peptide identification. BMC Genomics 17(Suppl 13):1031. https://doi.org/10.1186/s12864-016-3327-5

Li H, Park J, Kim H, Hwang KB, Paek E (2017) Systematic comparison of false-discovery-rate-controlling strategies for proteogenomic search using spike-in experiments. J Proteome Res 16(6):2231–2239. https://doi.org/10.1021/acs.jproteome.7b00033

Lu M, Xu L, Jian X, Tan X, Zhao J, Liu Z et al (2022) dbPepNeo2.0: a database for human tumor neoantigen peptides from mass spectrometry and TCR recognition. Front Immunol 13:855976. https://doi.org/10.3389/fimmu.2022.855976

Ma J, Saghatelian A, Shokhirev MN (2018) The influence of transcript assembly on the proteogenomics discovery of microproteins. PLoS ONE 13(3):e0194518. https://doi.org/10.1371/journal.pone.0194518

Mangalaparthi KK, Madugundu AK, Ryan ZC, Garapati K, Peterson JA, Dey G et al (2021) Digging deeper into the immunopeptidome: characterization of post-translationally modified peptides presented by MHC I. J Proteins Proteom 12(3):151–160. https://doi.org/10.1007/s42485-021-00066-x

Mani DR, Krug K, Zhang B, Satpathy S, Clauser KR, Ding L et al (2022) Cancer proteogenomics: current impact and future prospects. Nat Rev Cancer. https://doi.org/10.1038/s41568-022-00446-5

Menschaert G, Fenyo D (2017) Proteogenomics from a bioinformatics angle: a growing field. Mass Spectrom Rev 36(5):584–599. https://doi.org/10.1002/mas.21483

Nesvizhskii AI (2014) Proteogenomics: concepts, applications and computational strategies. Nat Methods 11(11):1114–1125. https://doi.org/10.1038/nmeth.3144

Omasits U, Varadarajan AR, Schmid M, Goetze S, Melidis D, Bourqui M et al (2017) An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics. Genome Res 27(12):2083–2095. https://doi.org/10.1101/gr.218255.116

Pang CN, Tay AP, Aya C, Twine NA, Harkness L, Hart-Smith G et al (2014) Tools to covisualize and coanalyze proteomic data with genomes and transcriptomes: validation of genes and alternative mRNA splicing. J Proteome Res 13(1):84–98. https://doi.org/10.1021/pr400820p

Park H, Bae J, Kim H, Kim S, Kim H, Mun DG et al (2014) Compact variant-rich customized sequence database and a fast and sensitive database search for efficient proteogenomic analyses. Proteomics 14(23–24):2742–2749. https://doi.org/10.1002/pmic.201400225

Park GW, Hwang H, Kim KH, Lee JY, Lee HK, Park JY et al (2016) Integrated proteomic pipeline using multiple search engines for a proteogenomic study with a controlled protein false discovery rate. J Proteome Res 15(11):4082–4090. https://doi.org/10.1021/acs.jproteome.6b00376

Park J, Piehowski PD, Wilkins C, Zhou M, Mendoza J, Fujimoto GM et al (2017) Informed-Proteomics: open-source software package for top-down proteomics. Nat Methods 14(9):909–914. https://doi.org/10.1038/nmeth.4388

Pertea M, Shumate A, Pertea G, Varabyou A, Breitwieser FP, Chang YC et al (2018) CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise. Genome Biol 19(1):208. https://doi.org/10.1186/s13059-018-1590-2

Peterson ES, McCue LA, Schrimpe-Rutledge AC, Jensen JL, Walker H, Kobold MA et al (2012) VESPA: software to facilitate genomic annotation of prokaryotic organisms through integration of proteomic and transcriptomic data. BMC Genomics 13:131. https://doi.org/10.1186/1471-2164-13-131

Raj A, Aggarwal S, Yadav AK, Dash D (2023) Quality control of variant peptides identified through proteogenomics- catching the (un)usual suspects. bioRxiv. https://doi.org/10.1101/2023.05.31.542998

Resing KA, Meyer-Arendt K, Mendoza AM, Aveline-Wolf LD, Jonscher KR, Pierce KG et al (2004) Improving reproducibility and sensitivity in identifying human proteins by shotgun proteomics. Anal Chem 76(13):3556–3568. https://doi.org/10.1021/ac035229m

Risk BA, Spitzer WJ, Giddings MC (2013) Peppy: proteogenomic search software. J Proteome Res 12(6):3019–3025. https://doi.org/10.1021/pr400208w

Robinson JT, Thorvaldsdottir H, Winckler W, Guttman M, Lander ES, Getz G et al (2011) Integrative genomics viewer. Nat Biotechnol 29(1):24–26. https://doi.org/10.1038/nbt.1754

Ruggles KV, Krug K, Wang X, Clauser KR, Wang J, Payne SH et al (2017) Methods, tools and current perspectives in proteogenomics. Mol Cell Proteomics 16(6):959–981. https://doi.org/10.1074/mcp.MR117.000024

Rutherford K, Parkhill J, Crook J, Horsnell T, Rice P, Rajandream MA et al (2000) Artemis: sequence visualization and annotation. Bioinformatics 16(10):944–945. https://doi.org/10.1093/bioinformatics/16.10.944

Schlaffner N, Pirklbauer GJ, Bender A, Choudhary JS (2017) Fast, quantitative and variant enabled mapping of peptides to genomes. Cell Syst 5(2):152–6.e4. https://doi.org/10.1016/j.cels.2017.07.007

Searle BC, Turner M, Nesvizhskii AI (2008) Improving sensitivity by probabilistically combining results from multiple MS/MS search methodologies. J Proteome Res 7(1):245–253. https://doi.org/10.1021/pr070540w

Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM et al (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29(1):308–311. https://doi.org/10.1093/nar/29.1.308

Sheynkman GM, Johnson JE, Jagtap PD, Shortreed MR, Onsongo G, Frey BL et al (2014) Using Galaxy-P to leverage RNA-Seq for the discovery of novel protein variations. BMC Genomics 15:703. https://doi.org/10.1186/1471-2164-15-703

Shilov IV, Seymour SL, Patel AA, Loboda A, Tang WH, Keating SP et al (2007) The Paragon Algorithm, a next generation search engine that uses sequence temperature values and feature probabilities to identify peptides from tandem mass spectra. Mol Cell Proteomics (MCP) 6(9):1638–1655. https://doi.org/10.1074/mcp.T600050-MCP200

Shteynberg D, Deutsch EW, Lam H, Eng JK, Sun Z, Tasman N et al (2011) iProphet: multi-level integrative analysis of shotgun proteomic data improves peptide and protein identification rates and error estimates. Mol Cell Proteomics (MCP) 10(12):111.007690. https://doi.org/10.1074/mcp.M111.007690

Tariq MU, Haseeb M, Aledhari M, Razzak R, Parizi RM, Saeed F (2021) Methods for proteogenomics data analysis, challenges, and scalability bottlenecks: a survey. IEEE Access 9:5497–5516. https://doi.org/10.1109/ACCESS.2020.3047588

Tate JG, Bamford S, Jubb HC, Sondka Z, Beare DM, Bindal N et al (2019) COSMIC: the catalogue of somatic mutations in cancer. Nucleic Acids Res 47(D1):D941–D947. https://doi.org/10.1093/nar/gky1015

Tavares R, de Miranda SN, Pauletti BA, Araujo E, Folador EL, Espindola G et al (2014) SpliceProt: a protein sequence repository of predicted human splice variants. Proteomics 14(2–3):181–185. https://doi.org/10.1002/pmic.201300078

The Cancer Genome Atlas Research Network, Weinstein JN, Collisson EA, Mills GB, Shaw KR, Ozenberger BA etal (2013) The cancer genome atlas pan-cancer analysis project. Nat Genet 45(10):1113–1120. https://doi.org/10.1038/ng.2764

Tolani P, Gupta S, Yadav K, Aggarwal S, Yadav AK (2021) Chapter four—Big data, integrative omics and network biology. In: Donev R, Karabencheva-Christova T (eds) Advances in protein chemistry and structural biology. Academic Press, pp 127–160

UniProt C (2019) UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res 47(D1):D506–D515. https://doi.org/10.1093/nar/gky1049

Van Damme P, Gawron D, Van Criekinge W, Menschaert G (2014) N-terminal proteomics and ribosome profiling provide a comprehensive view of the alternative translation initiation landscape in mice and men. Mol Cell Proteomics (MCP) 13(5):1245–1261. https://doi.org/10.1074/mcp.M113.036442

van de Geer WS, van Riet J, van de Werken HJG (2022) ProteoDisco: a flexible R approach to generate customized protein databases for extended search space of novel and variant proteins in proteogenomic studies. Bioinformatics 38(5):1437–1439. https://doi.org/10.1093/bioinformatics/btab809

Verbruggen S, Ndah E, Van Criekinge W, Gessulat S, Kuster B, Wilhelm M et al (2019) PROTEOFORMER 2.0: further developments in the ribosome profiling-assisted proteogenomic hunt for new proteoforms. Mol Cell Proteomics (MCP). https://doi.org/10.1074/mcp.RA118.001218

Wang X, Zhang B (2013) customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search. Bioinformatics 29(24):3235–3237. https://doi.org/10.1093/bioinformatics/btt543

Wang X, Slebos RJ, Chambers MC, Tabb DL, Liebler DC, Zhang B (2016) proBAMsuite, a bioinformatics framework for genome-based representation and analysis of proteomics data. Mol Cell Proteomics (MCP) 15(3):1164–1175. https://doi.org/10.1074/mcp.M115.052860

Wang LB, Karpova A, Gritsenko MA, Kyle JE, Cao S, Li Y et al (2021) Proteogenomic and metabolomic characterization of human glioblastoma. Cancer Cell 39(4):509–28.e20. https://doi.org/10.1016/j.ccell.2021.01.006

Woo S, Cha SW, Merrihew G, He Y, Castellana N, Guest C et al (2014a) Proteogenomic database construction driven from large scale RNA-seq data. J Proteome Res 13(1):21–28. https://doi.org/10.1021/pr400294c

Woo S, Cha SW, Na S, Guest C, Liu T, Smith RD et al (2014b) Proteogenomic strategies for identification of aberrant cancer peptides using large-scale next-generation sequencing data. Proteomics 14(23–24):2719–2730. https://doi.org/10.1002/pmic.201400206

Yadav AK, Kumar D, Dash D (2011a) MassWiz: a novel scoring algorithm with target-decoy based analysis pipeline for tandem mass spectrometry. J Proteome Res 10(5):2154–2160. https://doi.org/10.1021/pr200031z

Yadav AK, Bhardwaj G, Basak T, Kumar D, Ahmad S, Priyadarshini R et al (2011b) A systematic analysis of eluted fraction of plasma post immunoaffinity depletion: implications in biomarker discovery. PLoS ONE 6(9):e24442. https://doi.org/10.1371/journal.pone.0024442

Yang R, Zhu D, Kou Q, Bhat-Nakshatri P, Nakshatri H, Wu S et al (2017) A spectrum graph-based protein sequence filtering algorithm for proteoform identification by top-down mass spectrometry. In: Proceedings (IEEE Int Conf Bioinformatics Biomed), pp 222–229. https://doi.org/10.1109/BIBM.2017.8217653

Yates JR 3rd, Eng JK, McCormack AL (1995) Mining genomes: correlating tandem mass spectra of modified and unmodified peptides to sequences in nucleotide databases. Anal Chem 67(18):3202–3210

Yeom J, Kabir MH, Lim B, Ahn HS, Kim SY, Lee C (2016) A proteogenomic approach for protein-level evidence of genomic variants in cancer cells. Sci Rep 6:35305. https://doi.org/10.1038/srep35305

Zahn-Zabal M, Michel PA, Gateau A, Nikitin F, Schaeffer M, Audot E et al (2020) The neXtProt knowledgebase in 2020: data, tools and usability improvements. Nucleic Acids Res 48(D1):D328–D334. https://doi.org/10.1093/nar/gkz995

Zamdborg L, LeDuc RD, Glowacz KJ, Kim YB, Viswanathan V, Spaulding IT et al (2007) ProSight PTM 2.0: improved protein identification and characterization for top down mass spectrometry. Nucleic Acids Res 35(1):W701–W706. https://doi.org/10.1093/nar/gkm371

Zhang K, Fu Y, Zeng WF, He K, Chi H, Liu C et al (2015) A note on the false discovery rate of novel peptides in proteogenomics. Bioinformatics 31(20):3249–3253. https://doi.org/10.1093/bioinformatics/btv340

Zhang H, Liu T, Zhang Z, Payne SH, Zhang B, McDermott JE et al (2016) Integrated proteogenomic characterization of human high-grade serous ovarian cancer. Cell 166(3):755–765. https://doi.org/10.1016/j.cell.2016.05.069

Zhang M, Wang B, Xu J, Wang X, Xie L, Zhang B et al (2017) CanProVar 2.0: an updated database of human cancer proteome variation. J Proteome Res 16(2):421–432. https://doi.org/10.1021/acs.jproteome.6b00505

Zhang H, Bai L, Wu XQ, Tian X, Feng J, Wu X et al (2023) Proteogenomics of clear cell renal cell carcinoma response to tyrosine kinase inhibitor. Nat Commun 14(1):4274. https://doi.org/10.1038/s41467-023-39981-6

Zhu Y, Hultin-Rosenberg L, Forshed J, Branca RM, Orre LM, Lehtio J (2014) SpliceVista, a tool for splice variant identification and visualization in shotgun proteomics data. Mol Cell Proteomics (MCP) 13(6):1552–1562. https://doi.org/10.1074/mcp.M113.031203

Zhu Y, Orre LM, Johansson HJ, Huss M, Boekel J, Vesterlund M et al (2018) Discovery of coding regions in the human genome by integrated proteogenomics analysis workflow. Nat Commun 9(1):903. https://doi.org/10.1038/s41467-018-03311-y

Zickmann F, Renard BY (2015) MSProGene: integrative proteogenomics beyond six-frames and single nucleotide polymorphisms. Bioinformatics 31(12):i106–i115. https://doi.org/10.1093/bioinformatics/btv236