The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences

Nucleic Acids Research - Tập 50 Số D1 - Trang D543-D552 - 2022
Yasset Pérez‐Riverol1, Jinwen Bai1, Chakradhar Bandla1, David García‐Seisdedos1, Suresh Hewapathirana1, Selvakumar Kamatchinathan1, Deepti Jaiswal1, Ananth Prakash1, Anika Frericks-Zipper2,3, Martin Eisenacher2,3, Mathias Walzer1, Shengbo Wang1, Alvis Brāzma1, Juan Antonio Vizcaíno1
1European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
2Ruhr University Bochum, Center for Protein Diagnostics (PRODI), Medical Proteome Analysis, 44801 Bochum, Germany
3Ruhr University Bochum, Medical Faculty, Medizinisches Proteom-Center, D-44801 Bochum, Germany

Tóm tắt

AbstractThe PRoteomics IDEntifications (PRIDE) database (https://www.ebi.ac.uk/pride/) is the world's largest data repository of mass spectrometry-based proteomics data. PRIDE is one of the founding members of the global ProteomeXchange (PX) consortium and an ELIXIR core data resource. In this manuscript, we summarize the developments in PRIDE resources and related tools since the previous update manuscript was published in Nucleic Acids Research in 2019. The number of submitted datasets to PRIDE Archive (the archival component of PRIDE) has reached on average around 500 datasets per month during 2021. In addition to continuous improvements in PRIDE Archive data pipelines and infrastructure, the PRIDE Spectra Archive has been developed to provide direct access to the submitted mass spectra using Universal Spectrum Identifiers. As a key point, the file format MAGE-TAB for proteomics has been developed to enable the improvement of sample metadata annotation. Additionally, the resource PRIDE Peptidome provides access to aggregated peptide/protein evidences across PRIDE Archive. Furthermore, we will describe how PRIDE has increased its efforts to reuse and disseminate high-quality proteomics data into other added-value resources such as UniProt, Ensembl and Expression Atlas.

Từ khóa


Tài liệu tham khảo

Perez-Riverol, 2019, Quantifying the impact of public omics data, Nat. Commun., 10, 3512, 10.1038/s41467-019-11461-w

Perez-Riverol, 2019, The PRIDE database and related tools and resources in 2019: improving support for quantification data, Nucleic Acids Res., 47, D442, 10.1093/nar/gky1106

Deutsch, 2020, The ProteomeXchange consortium in 2020: enabling ‘big data’ approaches in proteomics, Nucleic Acids Res., 48, D1145

Ternent, 2014, How to submit MS proteomics data to ProteomeXchange via the PRIDE database, Proteomics, 14, 2233, 10.1002/pmic.201400120

Griss, 2014, The mzTab data exchange format: communicating mass-spectrometry-based proteomics and metabolomics experimental results to a wider audience, Mol. Cell. Proteomics, 13, 2765, 10.1074/mcp.O113.036681

Vizcaino, 2017, The mzIdentML Data Standard Version 1.2, Supporting Advances in Proteome Informatics, Mol. Cell. Proteomics, 16, 1275, 10.1074/mcp.M117.068429

Martens, 2011, mzML–a community standard for mass spectrometry data, Mol. Cell. Proteomics, 10, R110 000133, 10.1074/mcp.R110.000133

Vizcaino, 2014, ProteomeXchange provides globally coordinated proteomics data submission and dissemination, Nat. Biotechnol., 32, 223, 10.1038/nbt.2839

Perez-Riverol, 2016, PRIDE Inspector Toolsuite: moving toward a universal visualization tool for proteomics data standard formats and quality assessment of ProteomeXchange datasets, Mol. Cell. Proteomics, 15, 305, 10.1074/mcp.O115.050229

Yates, 2020, Ensembl 2020, Nucleic Acids Res., 48, D682

UniProt, 2021, UniProt: the universal protein knowledgebase in 2021, Nucleic Acids Res., 49, D480, 10.1093/nar/gkaa1100

Papatheodorou, 2020, Expression Atlas update: from tissues to single cells, Nucleic Acids Res., 48, D77

Deutsch, 2008, PeptideAtlas: a resource for target selection for emerging targeted proteomics workflows, EMBO Rep., 9, 429, 10.1038/embor.2008.56

Farrah, 2012, PASSEL: the PeptideAtlas SRMexperiment library, Proteomics, 12, 1170, 10.1002/pmic.201100515

Choi, 2020, MassIVE.quant: a community resource of quantitative mass spectrometry-based proteomics datasets, Nat. Methods, 17, 981, 10.1038/s41592-020-0955-0

Moriya, 2019, The jPOST environment: an integrated proteomics data repository and database, Nucleic. Acids. Res., 47, D1218, 10.1093/nar/gky899

Ma, 2019, iProX: an integrated proteome resource, Nucleic Acids Res., 47, D1211, 10.1093/nar/gky869

Sharma, 2018, Panorama public: a public repository for quantitative data sets processed in skyline, Mol. Cell. Proteomics, 17, 1239, 10.1074/mcp.RA117.000543

Deutsch, 2021, Universal Spectrum Identifier for mass spectra, Nat. Methods, 18, 768, 10.1038/s41592-021-01184-6

Drysdale, 2020, The ELIXIR Core Data Resources: fundamental infrastructure for the life sciences, Bioinformatics, 36, 2636, 10.1093/bioinformatics/btz959

Xu, 2014, jmzTab: a java interface to the mzTab data standard, Proteomics, 14, 1328, 10.1002/pmic.201300560

Reisinger, 2012, jmzIdentML API: a Java interface to the mzIdentML standard for peptide and protein identification data, Proteomics, 12, 790, 10.1002/pmic.201100577

Perez-Riverol, 2015, ms-data-core-api: an open-source, metadata-oriented library for computational proteomics, Bioinformatics, 31, 2903, 10.1093/bioinformatics/btv250

Uszkoreit, 2019, Protein inference using PIA workflows and PSI standard file formats, J. Proteome Res., 18, 741, 10.1021/acs.jproteome.8b00723

Uszkoreit, 2015, PIA: an intuitive protein inference engine with a web-based user interface, J. Proteome Res., 14, 2988, 10.1021/acs.jproteome.5b00121

Perkins, 1999, Probability-based protein identification by searching sequence databases using mass spectrometry data, Electrophoresis, 20, 3551, 10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2

Cox, 2008, MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification, Nat. Biotechnol., 26, 1367, 10.1038/nbt.1511

Pfeuffer, 2017, OpenMS–a platform for reproducible analysis of mass spectrometry data, J. Biotechnol., 261, 142, 10.1016/j.jbiotec.2017.05.016

Sinitcyn, 2021, MaxDIA enables library-based and library-free data-independent acquisition proteomics, Nat. Biotechnol., 10.1038/s41587-021-00968-7

Perez-Riverol, 2017, OLS client and OLS dialog: open source tools to annotate public omics datasets, Proteomics, 17, 1700244, 10.1002/pmic.201700244

Mischak, 2007, Clinical proteomics: a need to define the field and to begin to set adequate standards, Proteomics Clin Appl, 1, 148, 10.1002/prca.200600771

Griss, 2015, Identifying novel biomarkers through data mining-a realistic scenario?, Proteomics Clin. Appl., 9, 437, 10.1002/prca.201400107

Perez-Riverol, 2020, Toward a sample metadata standard in public proteomics repositories, J. Proteome Res., 19, 3906, 10.1021/acs.jproteome.0c00376

Dai, 2021, A proteomics sample metadata representation for multiomics integration and big data analysis, Nat. Commun., 12, 5854, 10.1038/s41467-021-26111-3

Rayner, 2006, A simple spreadsheet-based, MIAME-supportive format for microarray data: MAGE-TAB, BMC Bioinformatics, 7, 489, 10.1186/1471-2105-7-489

Gostev, 2012, The BioSample Database (BioSD) at the European Bioinformatics Institute, Nucleic Acids Res., 40, D64, 10.1093/nar/gkr937

Schmidt, 2021, Universal spectrum explorer: a standalone (web-)application for cross-resource spectrum comparison, J. Proteome Res., 20, 3388, 10.1021/acs.jproteome.1c00096

Griss, 2016, Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets, Nat. Methods, 13, 651, 10.1038/nmeth.3902

Qin, 2021, Deep learning embedder method and tool for mass spectra similarity search, J. Proteomics, 232, 104070, 10.1016/j.jprot.2020.104070

Bittremieux, 2021, Large-scale tandem mass spectrum clustering using fast nearest neighbor searching, Rapid Commun. Mass Spectrom., e9153, 10.1002/rcm.9153

Cook, 2020, The European Bioinformatics Institute in 2020: building a global infrastructure of interconnected data resources for the life sciences, Nucleic Acids Res., 48, D17, 10.1093/nar/gkz1033

Harrison, 2021, The COVID-19 Data Portal: accelerating SARS-CoV-2 and COVID-19 research through rapid open access data sharing, Nucleic Acids Res., 49, W619, 10.1093/nar/gkab417

Brunet, 2021, OpenProt 2021: deeper functional annotation of the coding potential of eukaryotic genomes, Nucleic Acids Res., 49, D380, 10.1093/nar/gkaa1036

Shao, 2020, MatrisomeDB: the ECM-protein knowledge database, Nucleic Acids Res., 48, D1136, 10.1093/nar/gkz849

Ramasamy, 2020, Scop3P: a comprehensive resource of human phosphosites within their full context, J. Proteome Res., 19, 3478, 10.1021/acs.jproteome.0c00306

Kustatscher, 2019, Co-regulation map of the human proteome enables identification of protein functions, Nat. Biotechnol., 37, 1361, 10.1038/s41587-019-0298-5

Omenn, 2020, Research on the human proteome reaches a major milestone: &gt;90% of predicted human proteins now credibly detected, according to the HUPO human proteome project, J. Proteome Res., 19, 4735, 10.1021/acs.jproteome.0c00485

Mitchell, 2020, MGnify: the microbiome analysis resource in 2020, Nucleic Acids Res., 48, D570

Umer, 2021, Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides, 10.1093/bioinformatics/btab838

Watkins, 2017, ProtVista: visualization of protein sequence annotations, Bioinformatics, 33, 2040, 10.1093/bioinformatics/btx120

Ochoa, 2020, The functional landscape of the human phosphoproteome, Nat. Biotechnol., 38, 365, 10.1038/s41587-019-0344-3

Jarnuczak, 2021, An integrated landscape of protein expression in human cancer, Sci Data, 8, 115, 10.1038/s41597-021-00890-2

Walzer, 2021, Implementing the re-use of public DIA proteomics datasets: from the PRIDE database to Expression Atlas, 10.1101/2021.06.08.447493

Bandeira, 2021, Data management of sensitive human proteomics data: current practices, recommendations, and perspectives for the future, Mol. Cell. Proteomics, 20, 100071, 10.1016/j.mcpro.2021.100071

Keane, 2021, The growing need for controlled data access models in clinical proteomics and metabolomics, Nat. Commun., 12, 5787, 10.1038/s41467-021-26110-4

Leitner, 2020, Toward increased reliability, transparency, and accessibility in cross-linking mass spectrometry, Structure, 28, 1259, 10.1016/j.str.2020.09.011

Bai, 2021, BioContainers Registry: searching bioinformatics and proteomics tools, packages, and containers, J. Proteome Res., 20, 2056, 10.1021/acs.jproteome.0c00904

Perez-Riverol, 2020, Scalable data analysis in proteomics and metabolomics using BioContainers and workflows engines, Proteomics, 20, e1900147, 10.1002/pmic.201900147