IDSL_MINT: a deep learning framework to predict molecular fingerprints from mass spectra
Tóm tắt
The majority of tandem mass spectrometry (MS/MS) spectra in untargeted metabolomics and exposomics studies lack any annotation. Our deep learning framework, Integrated Data Science Laboratory for Metabolomics and Exposomics—Mass INTerpreter (IDSL_MINT) can translate MS/MS spectra into molecular fingerprint descriptors. IDSL_MINT allows users to leverage the power of the transformer model for mass spectrometry data, similar to the large language models. Models are trained on user-provided reference MS/MS libraries via any customizable molecular fingerprint descriptors. IDSL_MINT was benchmarked using the LipidMaps database and improved the annotation rate of a test study for MS/MS spectra that were not originally annotated using existing mass spectral libraries. IDSL_MINT may improve the overall annotation rates in untargeted metabolomics and exposomics studies. The IDSL_MINT framework and tutorials are available in the GitHub repository at
https://github.com/idslme/IDSL_MINT
. Scientific contribution statement. Structural annotation of MS/MS spectra from untargeted metabolomics and exposomics datasets is a major bottleneck in gaining new biological insights. Machine learning models to convert spectra into molecular fingerprints can help in the annotation process. Here, we present IDSL_MINT, a new, easy-to-use and customizable deep-learning framework to train and utilize new models to predict molecular fingerprints from spectra for the compound annotation workflows.
Tài liệu tham khảo
Schrimpe-Rutledge AC et al (2016) Untargeted metabolomics strategies-challenges and emerging directions. J Am Soc Mass Spectrom 27(12):1897–1905
Baygi SF, Kumar Y, Barupal DK (2023) IDSL.CSA: composite spectra analysis for chemical annotation of untargeted metabolomics datasets. Anal Chem 95(25):9480–9487
Domingo-Almenara X et al (2018) Annotation: a computational solution for streamlining metabolomics analysis. Anal Chem 90(1):480–489 (PMC5750104)
Duhrkop K et al (2015) Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proc Natl Acad Sci U S A 112(41):12580–5 (PMC4611636)
Huber F et al (2021) Spec2Vec: Improved mass spectral similarity scoring through learning of structural relationships. PLoS Comput Biol 17(2):e1008724 (PMC7909622)
Elser, D., F. Huber, and E. Gaquerel, Mass2SMILES: deep learning based fast prediction of structures and functional groups directly from high-resolution MS/MS spectra. bioRxiv, 2023: p. 2023.07. 06.547963
Stravs MA et al (2022) MSNovelist: de novo structure generation from mass spectra. Nat Methods 19(7):865–870 (PMC9262714)
Huber F et al (2021) MS2DeepScore: a novel deep learning similarity measure to compare tandem mass spectra. J Cheminform 13(1):84 (PMC8556919)
de Jonge NF et al (2023) MS2Query: reliable and scalable MS(2) mass spectra-based analogue search. Nature Communication 14(1):1752 (PMC10060387)
Butler, T., et al., MS2Mol: A transformer model for illuminating dark chemical space from mass spectra. 2023
Voronov, G., et al., MS2Prop: A machine learning model that directly predicts chemical properties from mass spectrometry data for novel compounds. bioRxiv, 2022: p. 2022.10. 09.511482
Yang K et al (2019) Analyzing learned molecular representations for property prediction. J Chem Inf Model 59(8):3370–3388 (PMC6727618)
Stokes JM et al (2020) A deep learning approach to antibiotic discovery. Cell 180(4):688–702 (PMC8349178)
Stoyanova R et al (2023) Computational predictions of nonclinical pharmacokinetics at the drug design stage. J Chem Inf Model 63(2):442–458
Liu C et al (2023) ABT-MPNN: an atom-bond transformer-based message-passing neural network for molecular property prediction. J Cheminform 15(1):29 (PMC9968697)
Heid, E., et al., Chemprop: A Machine Learning Package for Chemical Property Prediction. 2023
Vaswani A et al (2017) Attention is all you need. Advances in Neural Information Processing Systems 30:1
Li Y et al (2021) Spectral entropy outperforms MS/MS dot product similarity for small-molecule compound identification. Nat Methods 18(12):1524–1531
Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50(5):742–754
Yongye AB et al (2011) Consensus models of activity landscapes with multiple chemical, conformer, and property representations. J Chem Inf Model 51(6):1259–1270
Xie L et al (2020) Improvement of prediction performance with conjoint molecular fingerprint in deep learning. Front Pharmacol 11:606668 (PMC7819282)
Schwaller P et al (2021) Mapping the space of chemical reactions using attention-based neural networks. Nat Mach Intell 3(2):144–152
Kind T et al (2014) LipidBlast templates as flexible tools for creating new in-silico tandem mass spectral libraries. Anal Chem 86(22):11024–7 (PMC428643)
Fahy E et al (2007) LIPID MAPS online tools for lipid research. Nucleic Acids Res 35:W606-12 (PMC1933166)
Baygi SF et al (2022) IDSLUFA Assigns high-confidence molecular formula annotations for untargeted LC/HRMS data sets in metabolomics and exposomics. Anal Chem 94(39):13315–13322 (PMC9682628)
Fakouri-Baygi S, Kumar Y, Barupal DK (2022) IDSL.IPA characterizes the organic chemical space in untargeted LC/HRMS data sets. J Proteome Res 21(6):1485–1494 (PMC9177784)
Barupal, S.F.B.D.K., Data and results for the IDSL.MINT publication, in Zenodo. 2023.
Ji H et al (2020) Predicting a molecular fingerprint from an electron ionization mass spectrum with deep neural networks. Anal Chem 92(13):8649–8653
Bickerton GR et al (2012) Quantifying the chemical beauty of drugs. Nat Chem 4(2):90–8 (PMC3524573)
Ertl P, Schuffenhauer A (2009) Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J Cheminform 1(1):8 (PMC3225829)
Lo Y-C et al (2018) Machine learning in chemoinformatics and drug discovery. Drug Discovery Today 23(8):1538–1546
Chen K et al (2023) MetaRF: attention-based random forest for reaction yield prediction with a few trails. J Cheminform 15(1):1–12
Colby SM et al (2019) ISiCLE: a quantum chemistry pipeline for establishing in silico collision cross section libraries. Anal Chem 91(7):4346–4356
Sutton C et al (2020) Identifying domains of applicability of machine learning models for materials science. Nat Commun 11(1):4428
Duhrkop K et al (2019) SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat Methods 16(4):299–302