Comparative analysis of chemical similarity methods for modular natural products with a hypothetical structure enumeration algorithm

Michael A. Skinnider1,2, Chris A. Dejong1,2, Brian C. Franczak3,4, Paul D. McNicholas4, Nathan A. Magarvey1,2
1Department of Chemistry and Chemical Biology, Michael G. DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, Canada
2Department of Biochemistry and Biomedical Sciences, Michael G. DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, Canada
3Department of Mathematics and Statistics, MacEwan University, Edmonton, Canada
4Department of Mathematics and Statistics, McMaster University, Hamilton, Canada

Tóm tắt

Natural products represent a prominent source of pharmaceutically and industrially important agents. Calculating the chemical similarity of two molecules is a central task in cheminformatics, with applications at multiple stages of the drug discovery pipeline. Quantifying the similarity of natural products is a particularly important problem, as the biological activities of these molecules have been extensively optimized by natural selection. The large and structurally complex scaffolds of natural products distinguish their physical and chemical properties from those of synthetic compounds. However, no analysis of the performance of existing methods for molecular similarity calculation specific to natural products has been reported to date. Here, we present LEMONS, an algorithm for the enumeration of hypothetical modular natural product structures. We leverage this algorithm to conduct a comparative analysis of molecular similarity methods within the unique chemical space occupied by modular natural products using controlled synthetic data, and comprehensively investigate the impact of diverse biosynthetic parameters on similarity search. We additionally investigate a recently described algorithm for natural product retrobiosynthesis and alignment, and find that when rule-based retrobiosynthesis can be applied, this approach outperforms conventional two-dimensional fingerprints, suggesting it may represent a valuable approach for the targeted exploration of natural product chemical space and microbial genome mining. Our open-source algorithm is an extensible method of enumerating hypothetical natural product structures with diverse potential applications in bioinformatics.

Từ khóa


Tài liệu tham khảo

Bender A, Glen RC (2004) Molecular similarity: a key technique in molecular informatics. Org Biomol Chem 2(22):3204–3218

Maggiora G, Vogt M, Stumpfe D, Bajorath J (2014) Molecular similarity in medicinal chemistry. J Med Chem 57(8):3186–3204

Bender A, Jenkins JL, Scheiber J, Sukuru SCK, Glick M, Davies JW (2009) How similar are similarity searching methods? A principal component analysis of molecular descriptor space. J Chem Inf Model 49(1):108–119

Cereto-Massague A, Ojeda MJ, Valls C, Mulero M, Garcia-Vallve S, Pujadas G (2015) Molecular fingerprint similarity search in virtual screening. Methods 71:58–63

Bajusz D, Racz A, Heberger K (2015) Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? J Cheminform 7:20

Irwin JJ (2008) Community benchmarks for virtual screening. J Comput Aided Mol Des 22(3–4):193–199

Rohrer SG, Baumann K (2009) Maximum unbiased validation (MUV) data sets for virtual screening based on PubChem bioactivity data. J Chem Inf Model 49(2):169–184

Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A et al (2012) ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 40(Database issue):D1100–D1107

Tiikkainen P, Markt P, Wolber G, Kirchmair J, Distinto S, Poso A et al (2009) Critical comparison of virtual screening methods against the MUV data set. J Chem Inf Model 49(10):2168–2178

Venkatraman V, Perez-Nueno VI, Mavridis L, Ritchie DW (2010) Comprehensive comparison of ligand-based virtual screening tools against the DUD data set reveals limitations of current 3D methods. J Chem Inf Model 50(12):2079–2093

Hu G, Kuang G, Xiao W, Li W, Liu G, Tang Y (2012) Performance evaluation of 2D fingerprint and 3D shape similarity methods in virtual screening. J Chem Inf Model 52(5):1103–1113

Riniker S, Landrum GA (2013) Open-source platform to benchmark fingerprints for ligand-based virtual screening. J Cheminform 5(1):26

Chen X, Reynolds CH (2002) Performance of similarity measures in 2D fragment-based similarity searching: comparison of structural descriptors and similarity coefficients. J Chem Inf Comput Sci 42(6):1407–1414

Holliday JD, Hu CY, Willett P (2002) Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2D fragment bit-strings. Comb Chem High Throughput Screen 5(2):155–166

Todeschini R, Consonni V, Xiang H, Holliday J, Buscema M, Willett P (2012) Similarity coefficients for binary chemoinformatics data: overview and extended comparison using simulated and real data sets. J Chem Inf Model 52(11):2884–2901

Salim N, Holliday J, Willett P (2003) Combination of fingerprint-based similarity coefficients using data fusion. J Chem Inf Comput Sci 43(2):435–442

Whittle M, Gillet VJ, Willett P, Alex A, Loesel J (2004) Enhancing the effectiveness of virtual screening by fusing nearest neighbor lists: a comparison of similarity coefficients. J Chem Inf Comput Sci 44(5):1840–1848

Willett P (2013) Combination of similarity rankings using data fusion. J Chem Inf Model 53(1):1–10

Newman DJ, Cragg GM (2012) Natural products as sources of new drugs over the 30 years from 1981 to 2010. J Nat Prod 75(3):311–335

Clardy J, Walsh C (2004) Lessons from natural molecules. Nature 432(7019):829–837

Henkel T, Brunne RM, Muller H, Reichel F (1999) Statistical investigation into the structural complementarity of natural products and synthetic compounds. Angew Chem Int Ed Engl 38(5):643–647

Feher M, Schmidt JM (2003) Property distributions: differences between drugs, natural products, and molecules from combinatorial chemistry. J Chem Inf Comput Sci 43(1):218–227

Lee ML, Schneider G (2001) Scaffold architecture and pharmacophoric properties of natural products and trade drugs: application in the design of natural product-based combinatorial libraries. J Comb Chem 3(3):284–289

Hert J, Irwin JJ, Laggner C, Keiser MJ, Shoichet BK (2009) Quantifying biogenic bias in screening libraries. Nat Chem Biol 5(7):479–483

Eberhardt L, Kumar K, Waldmann H (2011) Exploring and Exploiting Biologically Relevant Chemical Space. Curr Drug Targets 12(11):1531–1546

Lipinski CA, Lombardo F, Dominy BW, Feeney PJ (1997) Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliver Rev 23(1–3):3–25

Ganesan A (2008) The impact of natural products upon modern drug discovery. Curr Opin Chem Biol 12(3):306–317

Koch MA, Schuffenhauer A, Scheck M, Wetzel S, Casaulta M, Odermatt A et al (2005) Charting biologically relevant chemical space: a structural classification of natural products (SCONP). Proc Natl Acad Sci USA 102(48):17272–17277

Larsson J, Gottfries J, Muresan S, Backlund A (2007) ChemGPS-NP: tuned for navigation in biologically relevant chemical space. J Nat Prod 70(5):789–794

Ertl P, Roggo S, Schuffenhauer A (2008) Natural product-likeness score and its application for prioritization of compound libraries. J Chem Inf Model 48(1):68–74

Rosen J, Gottfries J, Muresan S, Backlund A, Oprea TI (2009) Novel chemical space exploration via natural products. J Med Chem 52(7):1953–1962

Over B, Wetzel S, Grutter C, Nakai Y, Renner S, Rauh D et al (2013) Natural-product-derived fragments for fragment-based ligand discovery. Nat Chem 5(1):21–28

Johnston CW, Skinnider MA, Dejong CA, Rees PN, Chen GM, Walker CG, French S, Brown ED, Bérdy J, Liu DY, Magarvey NA (2016) Assembly and clustering of natural antibiotics guides target identification. Nat Chem Biol 12(4):233–239

Riniker S, Landrum GA (2013) Similarity maps - a visualization strategy for molecular fingerprints and machine-learning methods. J Cheminform 5(1):43

Vidal D, Thormann M, Pons M (2005) LINGO, an efficient holographic text based method to calculate biophysical properties and intermolecular similarities. J Chem Inf Model 45(2):386–393

Grant JA, Haigh JA, Pickup BT, Nicholls A, Sayle RA (2006) Lingos, finite state machines, and fast similarity searching. J Chem Inf Model 46(5):1912–1918

Haque IS, Pande VS, Walters WP (2010) SIML: a fast SIMD algorithm for calculating LINGO chemical similarities on GPUs and CPUs. J Chem Inf Model 50(4):560–564

Walsh CT, O’Brien RV, Khosla C (2013) Nonproteinogenic amino acid building blocks for nonribosomal peptide and hybrid polyketide scaffolds. Angew Chem Intl Ed Engl 52(28):7098–7124

Skinnider MA, Dejong CA, Rees PN, Johnston CW, Li H, Webster AL et al (2015) Genomes to natural products PRediction Informatics for Secondary Metabolomes (PRISM). Nucleic Acids Res 43(20):9645–9662

Moore BS, Hertweck C (2002) Biosynthesis and attachment of novel bacterial polyketide synthase starter units. Nat Prod Rep 19(1):70–99

Duan J, Dixon SL, Lowrie JF, Sherman W (2010) Analysis and comparison of 2D fingerprints: insights into database screening performance using eight fingerprint methods. J Mol Graph Model 29(2):157–170

Walsh CT (2015) A chemocentric view of the natural product inventory. Nature Chem Biol 11(9):620–624

Zhang Q, Ortega M, Shi Y, Wang H, Melby JO, Tang W et al (2014) Structural investigation of ribosomally synthesized natural products by hypothetical structure enumeration and evaluation using tandem MS. Proc Natl Acad Sci USA 111(33):12031–12036

Johnston CW, Skinnider MA, Wyatt MA, Li X, Ranieri MR, Yang L et al (2015) An automated Genomes-to-Natural Products platform (GNP) for the discovery of modular natural products. Nat Commun 6:8421

Steinbeck C, Hoppe C, Kuhn S, Floris M, Guha R, Willighagen EL (2006) Recent developments of the chemistry development kit (CDK) - an open-source java library for chemo- and bioinformatics. Curr Pharm Des 12(17):2111–2120

Vaillancourt FH, Yeh E, Vosburg DA, Garneau-Tsodikova S, Walsh CT (2006) Nature’s inventory of halogenation catalysts: oxidative strategies predominate. Chem Rev 106(8):3364–3378

Durant JL, Leland BA, Henry DR, Nourse JG (2002) Reoptimization of MDL keys for use in drug discovery. J Chem Inf Comput Sci 42(6):1273–1280

Bolton EE, Wang Y, Thiessen PA, Bryant SH (2008) PubChem: integrated platform of small molecules and biological activities. Annu Rep Comput Chem 31(4):217–241

Hall LH, Kier LB (1995) Electrotopological state indexes for atom types - a novel combination of electronic, topological, and valence state information. J Chem Inf Comput Sci 35(6):1039–1045

Klekota J, Roth FP (2008) Chemical substructures that enrich for biological activity. Bioinformatics 24(21):2518–2525

Daylight Toolkit. Daylight Chemical Information Systems. Inc.: Aliso Viejo, CA. 2007

Landrum G. RDKit: Open-source cheminformatics. http://www.rdkit.org

Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50(5):742–754