Combatting over-specialization bias in growing chemical databases
Tóm tắt
Predicting in advance the behavior of new chemical compounds can support the design process of new products by directing the research toward the most promising candidates and ruling out others. Such predictive models can be data-driven using Machine Learning or based on researchers’ experience and depend on the collection of past results. In either case: models (or researchers) can only make reliable assumptions about compounds that are similar to what they have seen before. Therefore, consequent usage of these predictive models shapes the dataset and causes a continuous specialization shrinking the applicability domain of all trained models on this dataset in the future, and increasingly harming model-based exploration of the space. In this paper, we propose cancels (CounterActiNg Compound spEciaLization biaS), a technique that helps to break the dataset specialization spiral. Aiming for a smooth distribution of the compounds in the dataset, we identify areas in the space that fall short and suggest additional experiments that help bridge the gap. Thereby, we generally improve the dataset quality in an entirely unsupervised manner and create awareness of potential flaws in the data. cancels does not aim to cover the entire compound space and hence retains a desirable degree of specialization to a specified research domain. An extensive set of experiments on the use-case of biodegradation pathway prediction not only reveals that the bias spiral can indeed be observed but also that cancels produces meaningful results. Additionally, we demonstrate that mitigating the observed bias is crucial as it cannot only intervene with the continuous specialization process, but also significantly improves a predictor’s performance while reducing the number of required experiments. Overall, we believe that cancels can support researchers in their experimentation process to not only better understand their data and potential flaws, but also to grow the dataset in a sustainable way. All code is available under
github.com/KatDost/Cancels
.
Tài liệu tham khảo
Caliskan A, Bryson JJ, Narayanan A (2017) Semantics derived automatically from language corpora contain human-like biases. Science 356(6334):183–186. https://doi.org/10.1126/science.aal4230
Sieg J, Flachsenberg F, Rarey M (2019) In need of bias control: evaluating chemical data for machine learning in structure-based virtual screening. J Chem Inf Model 59(3):947–961. https://doi.org/10.1021/acs.jcim.8b00712
Hert J, Irwin JJ, Laggner C, Keiser MJ, Shoichet BK (2009) Quantifying biogenic bias in screening libraries. Nat Chem Biol 5(7):479–483. https://doi.org/10.1038/nchembio.180
Kerstjens A, De Winter H (2022) LEADD: lamarckian evolutionary algorithm for de novo drug design. J Cheminform 14(1):1–20. https://doi.org/10.1186/s13321-022-00582-y
Gregori-Puigjané E, Mestres J (2008) Coverage and bias in chemical library design. Curr Opin Chem Biol 12(3):359–365. https://doi.org/10.1016/j.cbpa.2008.03.015
Aniceto N, Freitas AA, Bender A, Ghafourian T (2016) A novel applicability domain technique for mapping predictive reliability across the chemical space of a QSAR: reliability-density neighbourhood. J Cheminform 8(1):1–20. https://doi.org/10.1186/s13321-016-0182-y
Sahigara F, Ballabio D, Todeschini R, Consonni V (2013) Defining a novel k-nearest neighbours approach to assess the applicability domain of a QSAR model for reliable predictions. J Cheminform 5:27. https://doi.org/10.1186/1758-2946-5-27
Cleves AE, Jain AN (2008) Effects of inductive bias on computational evaluations of ligand-based modeling and on drug discovery. J Comput Aided Mol Des 22(3–4):147–159. https://doi.org/10.1007/s10822-007-9150-y
Jia X, Lynch A, Huang Y, Danielson M, Lang’at I, Milder A, Ruby AE, Wang H, Friedler SA, Norquist AJ, Schrier J (2019) Anthropogenic biases in chemical reaction data hinder exploratory inorganic synthesis. Nature 573(7773):251–255. https://doi.org/10.1038/s41586-019-1540-5
Settles B (2012) Active learning. Synth Lect Artif Intell Mach Learn 6(1):1–114. https://doi.org/10.2200/S00429ED1V01Y201207AIM018
Ovadia Y, Fertig E, Ren J, Nado Z, Sculley D, Nowozin S, Dillon JV, Lakshminarayanan B, Snoek J (2019) Can You Trust Your Model’s Uncertainty? Evaluating Predictive Uncertainty under Dataset Shift. Curran Associates Inc., Red Hook, NY, USA
Dost K, Taskova K, Riddle P, Wicker J (2020) Your best guess when you know nothing: identification and mitigation of selection bias. In: 20th IEEE International Conference on Data Mining, ICDM 2020, Sorrento, Italy, November 17-20, 2020, IEEE, New York, pp 996–1001. https://doi.org/10.1109/ICDM50108.2020.00115
Dost K, Duncanson H, Ziogas I, Riddle P, Wicker J (2022) Divide and imitate: Multi-cluster identification and mitigation of selection bias. In: Advances in Knowledge Discovery and Data Mining—26th Pacific-Asia Conference, PAKDD 2022. Lecture Notes in Computer Science, vol 13281, Springer, Cham, pp 149–160. https://doi.org/10.1007/978-3-031-05936-0_12
Mouchlis VD, Afantitis A, Serra A, Fratello M, Papadiamantis AG, Aidinis V, Lynch I, Greco D, Melagraki G (2021) Advances in de novo drug design: from conventional to machine learning methods. Int J Mol Sci 22(4):1–22. https://doi.org/10.3390/ijms22041676
Schneider G, Clark DE (2019) Automated de novo drug design: are we nearly there yet? Angew Chem Int Ed 58(32):10792–10803. https://doi.org/10.1002/anie.201814681
Kwon Y, Lee J (2021) MolFinder: an evolutionary algorithm for the global optimization of molecular properties and the extensive exploration of chemical space using SMILES. J Cheminform 13(1):1–14. https://doi.org/10.1186/s13321-021-00501-7
Schneider P, Schneider G (2016) De novo design at the edge of chaos. J Med Chem 59(9):4077–4086. https://doi.org/10.1021/acs.jmedchem.5b01849. (PMID: 26881908)
Arús-Pous J, Blaschke T, Ulander S, Reymond JL, Chen H, Engkvist O (2019) Exploring the GDB-13 chemical space using deep generative models. J Cheminform 11(1):1–14. https://doi.org/10.1186/s13321-019-0341-z
Kang SG, Morrone JA, Weber JK, Cornell WD (2022) Analysis of training and seed bias in small molecules generated with a conditional graph-based variational autoencoder—insights for practical AI-driven molecule generation. J Chem Inf Model 62(4):801–816. https://doi.org/10.1021/acs.jcim.1c01545
Pereira T, Abbasi M, Ribeiro B, Arrais JP (2021) Diversity oriented deep reinforcement learning for targeted molecule generation. J Cheminform 13(1):1–17. https://doi.org/10.1186/s13321-021-00498-z
Bareinboim E, Tian J, Pearl J (2014) Recovering from selection bias in causal and statistical inference. Proc AAAI Conf Artif Intell. 28(1):9074
Lyon A (2014) Why are normal distributions normal? Br J Philos Sci 65(3):621–649. https://doi.org/10.1093/bjps/axs046
Hoeffding W, Robbins H (1948) The central limit theorem for dependent random variables. Duke Math J 15(3):773–780. https://doi.org/10.1215/S0012-7094-48-01568-3
Hyvärinen A, Oja E (2000) Independent component analysis: algorithms and applications. Neural Netw 13(4):411–430. https://doi.org/10.1016/S0893-6080(00)00026-5
Panigrahi S, Nanda A, Swarnkar T (2021) A survey on transfer learning. Smart Innov Syst Technol 194(10):781–789. https://doi.org/10.1007/978-981-15-5971-6_83
Gretton A, Smola A, Huang J, Schmittfull M, Borgwardt K, Schölkopf B (2013) Covariate shift by Kernel mean matching. Dataset Shift Mach Learn. https://doi.org/10.7551/mitpress/9780262170055.003.0008
McGaughey G, Walters W, Goldman B (2016) Understanding covariate shift in model performance. F1000Research. https://doi.org/10.12688/f1000research.8317.1
Bickel S, Brückner M, Scheffer T (2007) Discriminative learning for differing training and test distributions. In: Proceedings of the 24th International Conference on Machine Learning. ICML ’07, Association for Computing Machinery, New York, NY, USA, pp 81–88. https://doi.org/10.1145/1273496.1273507
Cortes C, Mohri M, Riley M, Rostamizadeh A (2008) Sample selection bias correction theory. In: Proceedings of the 19th International Conference on Algorithmic Learning Theory. ALT ’08Springer, Berlin, Heidelberg, pp 38–53. https://doi.org/10.1007/978-3-540-87987-9_8
Zadrozny B (2004) Learning and evaluating classifiers under sample selection bias. In: Proceedings of the Twenty-First International Conference on Machine Learning. ICML ’04, Association for Computing Machinery, New York, NY, USA, p 114. https://doi.org/10.1145/1015330.1015425
Huang J, Smola A.J, Gretton A, Borgwardt KM, Schölkopf B (2007) Correcting sample selection bias by unlabeled data. In: Advances in Neural Information Processing Systems, pp 601–608. https://doi.org/10.7551/mitpress/7503.003.0080
Lin Y, Lee Y, Wahba G (2002) Support vector machines for classification in nonstandard situations. Mach Learn 46(1–3):191–202. https://doi.org/10.1023/A:1012406528296
Sugiyama M, Müller K-R (2005) Input-dependent estimation of generalization error under covariate shift 23(4):249–279. https://doi.org/10.1524/stnd.2005.23.4.249
Baum EB, Lang K (1992) Query learning can work poorly when a human oracle is used. In: International Joint Conference on Neural Networks, vol 8, p 8
Smith JS, Nebgen B, Lubbers N, Isayev O, Roitberg AE (2018) Less is more: sampling chemical space with active learning. J Chem Phys 148(24):241733. https://doi.org/10.1063/1.5023802
Reker D, Schneider G (2015) Active-learning strategies in computer-assisted drug discovery. Drug Discov Today 20(4):458–465. https://doi.org/10.1016/j.drudis.2014.12.004
Habib Polash A, Nakano T, Rakers C, Takeda S, Brown JB (2020) Active learning efficiently converges on rational limits of toxicity prediction and identifies patterns for molecule design. Comput Toxicol 15:100129. https://doi.org/10.1016/j.comtox.2020.100129
Reker D, Schneider P, Schneider G, Brown J (2017) Active learning for computational chemogenomics. Future Med Chem 9(4):381–402. https://doi.org/10.4155/fmc-2016-0197
Zhong S, Lambeth DR, Igou TK, Chen Y (2022) Enlarging applicability domain of quantitative structure-activity relationship models through uncertainty-based active learning. ACS ES &T Eng 2(7):1211–1220. https://doi.org/10.1021/acsestengg.1c00434
Sugiyama M, Rubens N (2008) A batch ensemble approach to active learning with model selection. Neural Netw 21(9):1278–1286.
Bohacek RS, McMartin C, Guida WC (1996) The art and practice of structure-based drug design: a molecular modeling perspective. Med Res Rev 16(1):3–50
Idakwo G, Thangapandian S, Luttrell J, Li Y, Wang N, Zhou Z, Hong H, Yang B, Zhang C, Gong P (2020) Structure-activity relationship-based chemical classification of highly imbalanced Tox21 datasets. J Cheminform 12(1):1–19. https://doi.org/10.1186/s13321-020-00468-x
Stepišnik T, Škrlj B, Wicker J, Kocev D, (2021) A comprehensive comparison of molecular feature representations for use in predictive modeling. Comput Biol Med 130:104197. https://doi.org/10.1016/j.compbiomed.2020.104197
Kim S, Chen J, Cheng T, Gindulyte A, He J, He S, Li Q, Shoemaker BA, Thiessen PA, Yu B, Zaslavsky L, Zhang J, Bolton EE (2020) PubChem in 2021: new data content and improved web interfaces. Nucleic Acids Res 49(D1):1388–1395. https://doi.org/10.1093/nar/gkaa971
Kuwahara H, Gao X (2021) Analysis of the effects of related fingerprints on molecular similarity using an eigenvalue entropy approach. J Cheminform 13(1):1–12. https://doi.org/10.1186/s13321-021-00506-2
Martin E, Cao E (2015) Euclidean chemical spaces from molecular fingerprints: Hamming distance and Hempel’s ravens. J Comput Aided Mol Des 29:387–395. https://doi.org/10.1007/s10822-014-9819-y
Mead A (1992) Review of the development of multidimensional scaling methods. J R Stat Soc Series D 41(1):27–39
Granichin O, Volkovich Z, Toledano-Kitai D (2015) Randomized algorithms in automatic control and data mining vol 67. https://doi.org/10.1007/978-3-642-54786-7
Dost K (2022) CANCELS experiments and implementation. https://github.com/KatDost/Cancels. Accessed 21 Sep 2022
Latino D, Wicker J, Gütlein M, Schmid E, Kramer S, Fenner K (2017) Eawag-soil in envipath: a new resource for exploring regulatory pesticide soil biodegradation pathways and half-life data. Enviro Sci Process Impact. https://doi.org/10.1039/C6EM00697C
Wicker J, Fenner K, Ellis L, Wackett L, Kramer S (2010) Predicting biodegradation products and pathways: a hybrid knowledge- and machine learning-based approach. Bioinformatics 26(6):814–821. https://doi.org/10.1093/bioinformatics/btq024
Wicker J, Fenner K, Kramer S (2016) A hybrid machine learning and knowledge based approach to limit combinatorial explosion in biodegradation prediction. In: Lässig J, Kersting K, Morik K (eds) Comput Sustain. Springer, Cham, pp 75–97
Wicker J, Lorsbach T, Gütlein M, Schmid E, Latino D, Kramer S, Fenner K (2016) Envipath - the environmental contaminant biotransformation pathway resource. Nucleic Acid Res 44(D1):502–508. https://doi.org/10.1093/nar/gkv1229
Tam J, Lorsbach T, Schmidt S, Wicker J (2021) Holistic evaluation of biodegradation pathway prediction: assessing multi-step reactions and intermediate products. J Cheminform 13(1):63. https://doi.org/10.1186/s13321-021-00543-x
Mayr A, Klambauer G, Unterthiner T, Hochreiter S (2016) Deeptox: toxicity prediction using deep learning. Front Environ Sci. https://doi.org/10.3389/fenvs.2015.00080
Huang R, Xia M, Nguyen D-T, Zhao T, Sakamuru S, Zhao J, Shahane SA, Rossoshek A, Simeonov A (2016) Tox21challenge to build predictive models of nuclear receptor and stress response pathways as mediated by exposure to environmental chemicals and drugs. Front Environ Sci. https://doi.org/10.3389/fenvs.2015.00085
Read J, Pfahringer B, Holmes G, Frank E (2011) Classifier chains for multi-label classification. Mach Learn 85(3):333–359
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Herbold, S (2020) Autorank: A python package for automated ranking of classifiers. J Open Source Softw 5(48), 2173. https://doi.org/10.21105/joss.02173
Winter R, Montanari F, Noé F, Clevert D-A (2019) Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chem Sci 10:1692–1701. https://doi.org/10.1039/C8SC04175J
Yap CW (2011) Padel-descriptor: an open source software to calculate molecular descriptors and fingerprints. J Comput Chem 32(7):1466–1474. https://doi.org/10.1002/jcc.21707
Gladysz R, Dos Santos F.M, Langenaeker W, Thijs G, Augustyns K, De Winter H (2018) Spectrophores as one-dimensional descriptors calculated from three-dimensional atomic properties: applications ranging from scaffold hopping to multi-target virtual screening. Journal of Cheminformatics 10(1). https://doi.org/10.1186/s13321-018-0268-9
Jaeger S, Fulle S, Turk S (2018) Mol2vec: Unsupervised machine learning approach with chemical intuition. J Chem Inf Model 58(1):27–35. https://doi.org/10.1021/acs.jcim.7b00616
enviPath UG & Co. KG: SOIL dataset. https://envipath.org/package/5882df9c-dae1-4d80-a40e-db4724271456. Accessed 21 Sep 2022
enviPath UG & Co. KG: BBD dataset. https://envipath.org/package/32de3cf4-e3e6-4168-956e-32fa5ddb0ce1. Accessed 21 Sep 2022
enviPath UG & Co. KG: enviPath. https://envipath.org. Accessed 21 Sep 2022
National Center for Biotechnology Information: PubChem. https://pubchem.ncbi.nlm.nih.gov. Accessed 21 Sep 2022
National Center for Advancing Translational Sciences: Tox21 Data Challenge. https://tripod.nih.gov/tox21/challenge. Accessed 21 Sep 2022
Dost K, Brydon L (2022) PyPI Package “imitatebias”. https://pypi.org/project/imitatebias Accessed 21 Sep 2022
