IDSL_MINT: a deep learning framework to predict molecular fingerprints from mass spectra

Springer Science and Business Media LLC - Tập 16 - Trang 1-8 - 2024
Sadjad Fakouri Baygi1, Dinesh Kumar Barupal1
1Department of Environmental Medicine and Public Health, Icahn School of Medicine at Mount Sinai, New York, USA

Tóm tắt

The majority of tandem mass spectrometry (MS/MS) spectra in untargeted metabolomics and exposomics studies lack any annotation. Our deep learning framework, Integrated Data Science Laboratory for Metabolomics and Exposomics—Mass INTerpreter (IDSL_MINT) can translate MS/MS spectra into molecular fingerprint descriptors. IDSL_MINT allows users to leverage the power of the transformer model for mass spectrometry data, similar to the large language models. Models are trained on user-provided reference MS/MS libraries via any customizable molecular fingerprint descriptors. IDSL_MINT was benchmarked using the LipidMaps database and improved the annotation rate of a test study for MS/MS spectra that were not originally annotated using existing mass spectral libraries. IDSL_MINT may improve the overall annotation rates in untargeted metabolomics and exposomics studies. The IDSL_MINT framework and tutorials are available in the GitHub repository at https://github.com/idslme/IDSL_MINT . Scientific contribution statement. Structural annotation of MS/MS spectra from untargeted metabolomics and exposomics datasets is a major bottleneck in gaining new biological insights. Machine learning models to convert spectra into molecular fingerprints can help in the annotation process. Here, we present IDSL_MINT, a new, easy-to-use and customizable deep-learning framework to train and utilize new models to predict molecular fingerprints from spectra for the compound annotation workflows.

Tài liệu tham khảo

Schrimpe-Rutledge AC et al (2016) Untargeted metabolomics strategies-challenges and emerging directions. J Am Soc Mass Spectrom 27(12):1897–1905 Baygi SF, Kumar Y, Barupal DK (2023) IDSL.CSA: composite spectra analysis for chemical annotation of untargeted metabolomics datasets. Anal Chem 95(25):9480–9487 Domingo-Almenara X et al (2018) Annotation: a computational solution for streamlining metabolomics analysis. Anal Chem 90(1):480–489 (PMC5750104) Duhrkop K et al (2015) Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proc Natl Acad Sci U S A 112(41):12580–5 (PMC4611636) Huber F et al (2021) Spec2Vec: Improved mass spectral similarity scoring through learning of structural relationships. PLoS Comput Biol 17(2):e1008724 (PMC7909622) Elser, D., F. Huber, and E. Gaquerel, Mass2SMILES: deep learning based fast prediction of structures and functional groups directly from high-resolution MS/MS spectra. bioRxiv, 2023: p. 2023.07. 06.547963 Stravs MA et al (2022) MSNovelist: de novo structure generation from mass spectra. Nat Methods 19(7):865–870 (PMC9262714) Huber F et al (2021) MS2DeepScore: a novel deep learning similarity measure to compare tandem mass spectra. J Cheminform 13(1):84 (PMC8556919) de Jonge NF et al (2023) MS2Query: reliable and scalable MS(2) mass spectra-based analogue search. Nature Communication 14(1):1752 (PMC10060387) Butler, T., et al., MS2Mol: A transformer model for illuminating dark chemical space from mass spectra. 2023 Voronov, G., et al., MS2Prop: A machine learning model that directly predicts chemical properties from mass spectrometry data for novel compounds. bioRxiv, 2022: p. 2022.10. 09.511482 Yang K et al (2019) Analyzing learned molecular representations for property prediction. J Chem Inf Model 59(8):3370–3388 (PMC6727618) Stokes JM et al (2020) A deep learning approach to antibiotic discovery. Cell 180(4):688–702 (PMC8349178) Stoyanova R et al (2023) Computational predictions of nonclinical pharmacokinetics at the drug design stage. J Chem Inf Model 63(2):442–458 Liu C et al (2023) ABT-MPNN: an atom-bond transformer-based message-passing neural network for molecular property prediction. J Cheminform 15(1):29 (PMC9968697) Heid, E., et al., Chemprop: A Machine Learning Package for Chemical Property Prediction. 2023 Vaswani A et al (2017) Attention is all you need. Advances in Neural Information Processing Systems 30:1 Li Y et al (2021) Spectral entropy outperforms MS/MS dot product similarity for small-molecule compound identification. Nat Methods 18(12):1524–1531 Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50(5):742–754 Yongye AB et al (2011) Consensus models of activity landscapes with multiple chemical, conformer, and property representations. J Chem Inf Model 51(6):1259–1270 Xie L et al (2020) Improvement of prediction performance with conjoint molecular fingerprint in deep learning. Front Pharmacol 11:606668 (PMC7819282) Schwaller P et al (2021) Mapping the space of chemical reactions using attention-based neural networks. Nat Mach Intell 3(2):144–152 Kind T et al (2014) LipidBlast templates as flexible tools for creating new in-silico tandem mass spectral libraries. Anal Chem 86(22):11024–7 (PMC428643) Fahy E et al (2007) LIPID MAPS online tools for lipid research. Nucleic Acids Res 35:W606-12 (PMC1933166) Baygi SF et al (2022) IDSLUFA Assigns high-confidence molecular formula annotations for untargeted LC/HRMS data sets in metabolomics and exposomics. Anal Chem 94(39):13315–13322 (PMC9682628) Fakouri-Baygi S, Kumar Y, Barupal DK (2022) IDSL.IPA characterizes the organic chemical space in untargeted LC/HRMS data sets. J Proteome Res 21(6):1485–1494 (PMC9177784) Barupal, S.F.B.D.K., Data and results for the IDSL.MINT publication, in Zenodo. 2023. Ji H et al (2020) Predicting a molecular fingerprint from an electron ionization mass spectrum with deep neural networks. Anal Chem 92(13):8649–8653 Bickerton GR et al (2012) Quantifying the chemical beauty of drugs. Nat Chem 4(2):90–8 (PMC3524573) Ertl P, Schuffenhauer A (2009) Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J Cheminform 1(1):8 (PMC3225829) Lo Y-C et al (2018) Machine learning in chemoinformatics and drug discovery. Drug Discovery Today 23(8):1538–1546 Chen K et al (2023) MetaRF: attention-based random forest for reaction yield prediction with a few trails. J Cheminform 15(1):1–12 Colby SM et al (2019) ISiCLE: a quantum chemistry pipeline for establishing in silico collision cross section libraries. Anal Chem 91(7):4346–4356 Sutton C et al (2020) Identifying domains of applicability of machine learning models for materials science. Nat Commun 11(1):4428 Duhrkop K et al (2019) SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat Methods 16(4):299–302