Hướng dẫn thực tiễn cho việc sử dụng gradient boosting trong dự đoán thuộc tính phân tử

Davide Boldini1, Francesca Grisoni2, Daniel Kühn3, Lukas Friedrich3, Stephan A. Sieber1
1Department of Bioscience, Center for Functional Protein Assemblies (CPA), Technical University of Munich, Garching bei Munich, Germany
2Department of Biomedical Engineering, Institute for Complex Molecular Sciences, Eindhoven University of Technology, Eindhoven, The Netherlands
3Merck Healthcare KGaA, Darmstadt, Germany

Tóm tắt

Tóm tắt

Các tập hợp cây quyết định là một trong những phương pháp học máy mạnh mẽ, hiệu suất cao và tính toán hiệu quả nhất cho việc mô hình hóa mối quan hệ cấu trúc - hoạt tính định lượng (QSAR). Trong số đó, boosting gradient gần đây đã thu hút được sự chú ý đặc biệt nhờ vào hiệu suất của nó trong các cuộc thi khoa học dữ liệu, các chiến dịch sàng lọc ảo và dự đoán hoạt tính sinh học. Tuy nhiên, có nhiều biến thể khác nhau của boosting gradient, trong đó phổ biến nhất là XGBoost, LightGBM và CatBoost. Nghiên cứu của chúng tôi cung cấp sự so sánh toàn diện đầu tiên về những phương pháp này đối với QSAR. Để thực hiện điều này, chúng tôi đã huấn luyện 157.590 mô hình boosting gradient, được đánh giá trên 16 tập dữ liệu và 94 chỉ số, bao gồm tổng cộng 1,4 triệu hợp chất. Kết quả của chúng tôi cho thấy XGBoost thường đạt được hiệu suất dự đoán tốt nhất, trong khi LightGBM yêu cầu thời gian huấn luyện ít nhất, đặc biệt là cho các tập dữ liệu lớn hơn. Về tầm quan trọng của các đặc trưng, các mô hình bất ngờ xếp hạng các đặc trưng phân tử theo cách khác nhau, phản ánh sự khác biệt trong các kỹ thuật điều chỉnh và cấu trúc của cây quyết định. Do đó, kiến thức chuyên môn phải luôn được sử dụng khi đánh giá các giải thích dựa trên dữ liệu về hoạt tính sinh học. Hơn nữa, kết quả của chúng tôi cho thấy tính liên quan của từng tham số siêu biến đổi thay đổi rất lớn giữa các tập dữ liệu và rằng việc tối ưu hóa càng nhiều tham số siêu biến đổi càng tốt là điều quan trọng để tối đa hóa hiệu suất dự đoán. Tóm lại, nghiên cứu của chúng tôi cung cấp bộ hướng dẫn đầu tiên cho các nhà thực hành thông tin hóa hóa học để hiệu quả trong việc huấn luyện, tối ưu hóa và đánh giá các mô hình boosting gradient cho các ứng dụng sàng lọc ảo và QSAR.

Tóm tắt đồ họa

Từ khóa


Tài liệu tham khảo

Keshavarzi Arshadi A, Salem M, Firouzbakht A, Yuan JS (2022) MolData, a molecular benchmark for disease and target based machine learning. J Cheminf 14(1):10. https://doi.org/10.1186/s13321-022-00590-y

Yang K, Swanson K, Jin W, Coley C, Eiden P, Gao H, Guzman-Perez A, Hopper T, Kelley B, Mathea M, Palmer A, Settels V, Jaakkola T, Jensen K, Barzilay R (2019) Analyzing learned molecular representations for property prediction. J Chem Inf Model 59(8):3370–3388. https://doi.org/10.1021/acs.jcim.9b00237

Aleksić S, Seeliger D, Brown JB (2021) ADMET Predictability at Boehringer Ingelheim: state-of-the-art, and do bigger datasets or algorithms make a difference? Mol Inform. https://doi.org/10.1002/minf.202100113

Mayr A, Klambauer G, Unterthiner T, Steijaert M, Wegner JK, Ceulemans H, Clevert D-A, Hochreiter S (2018) Large-scale comparison of machine learning methods for drug target prediction on ChEMBL. Chem Sci 9(24):5441–5451. https://doi.org/10.1039/C8SC00148K

Chen H, Kogej T, Engkvist O (2018) Cheminformatics in drug discovery, an industrial perspective. Mol Inform 37(9–10):1800041. https://doi.org/10.1002/minf.201800041

Withnall M, Lindelöf E, Engkvist O, Chen H (2020) Building attention and edge message passing neural networks for bioactivity and physical-chemical property prediction. J Cheminf 12(1):1. https://doi.org/10.1186/s13321-019-0407-y

Santana MVS, De S-J (2021) Novo design and bioactivity prediction of sars-cov-2 main protease inhibitors using recurrent neural network-based transfer learning. BMC Chem 15(1):8. https://doi.org/10.1186/s13065-021-00737-2

Gawriljuk VO, Zin PPK, Puhl AC, Zorn KM, Foil DH, Lane TR, Hurst B, Tavella TA, Costa FTM, Lakshmanane P, Bernatchez J, Godoy AS, Oliva G, Siqueira-Neto JL, Madrid PB, Ekins S (2021) Machine learning models identify inhibitors of SARS-CoV-2. J Chem Inf Model. https://doi.org/10.1021/acs.jcim.1c00683

Stokes JM, Yang K, Swanson K, Jin W, Cubillos-Ruiz A, Donghia NM, MacNair CR, French S, Carfrae LA, Bloom-Ackermann Z, Tran VM, Chiappino-Pepe A, Badran AH, Andrews IW, Chory EJ, Church GM, Brown ED, Jaakkola TS, Barzilay R, Collins JJ (2020) A deep learning approach to antibiotic discovery. Cell 180(4):688-702.e13. https://doi.org/10.1016/j.cell.2020.01.021

Jain S, Siramshetty VB, Alves VM, Muratov EN, Kleinstreuer N, Tropsha A, Nicklaus MC, Simeonov A, Zakharov AV (2021) Large-scale modeling of multispecies acute toxicity end points using consensus of multitask deep learning methods. J Chem Inf Model 61(2):653–663. https://doi.org/10.1021/acs.jcim.0c01164

Walter M, Allen LN, de la Vega de León A, Webb SJ, Gillet VJ (2022) Analysis of the benefits of imputation models over traditional QSAR models for toxicity prediction. J Cheminf 14(1):32. https://doi.org/10.1186/s13321-022-00611-w

Zhang J, Mucs D, Norinder U, Svensson F (2019) LightGBM: an effective and scalable algorithm for prediction of chemical toxicity-application to the tox21 and mutagenicity data sets. J Chem Inf Model. https://doi.org/10.1021/acs.jcim.9b00633

Grisoni F, Consonni V, Ballabio D (2019) Machine learning consensus to predict the binding to the androgen receptor within the CoMPARA project. J Chem Inf Model 59(5):1839–1848. https://doi.org/10.1021/acs.jcim.8b00794

Xiong G, Wu Z, Yi J, Fu L, Yang Z, Hsieh C, Yin M, Zeng X, Wu C, Lu A, Chen X, Hou T, Cao D (2021) ADMETlab 20: an integrated online platform for accurate and comprehensive predictions of ADMET properties. Nucleic Acids Res. https://doi.org/10.1093/nar/gkab255

Chuang KV, Gunsalus LM, Keiser MJ (2020) Learning molecular representations for medicinal chemistry: miniperspective. J Med Chem 63(16):8705–8722. https://doi.org/10.1021/acs.jmedchem.0c00385

Jiang D, Wu Z, Hsieh C-Y, Chen G, Liao B, Wang Z, Shen C, Cao D, Wu J, Hou T (2021) Could Graph neural networks learn better molecular representation for drug discovery? a comparison study of descriptor-based and graph-based models. J Cheminf 13(1):12. https://doi.org/10.1186/s13321-020-00479-8

Winter R, Montanari F, Noé F, Clevert D-A (2019) Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chem Sci 10(6):1692–1701. https://doi.org/10.1039/C8SC04175J

Biau G, Scornet E (2016) A random forest guided tour. TEST 25(2):197–227. https://doi.org/10.1007/s11749-016-0481-7

Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, Katz R, Himmelfarb J, Bansal N, Lee S-I (2020) From local explanations to global understanding with explainable AI for trees. Nat Mach Intell 2(1):56–67. https://doi.org/10.1038/s42256-019-0138-9

Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297. https://doi.org/10.1007/BF00994018

Cervantes J, Garcia-Lamont F, Rodríguez-Mazahua L, Lopez A (2020) A comprehensive survey on support vector machine classification: applications. Chall Trends Neurocomp 408:189–215. https://doi.org/10.1016/j.neucom.2019.10.118

Shwartz-Ziv R, Armon A (2022) Tabular data: deep learning is not all you need. Inf Fusion 81:84–90. https://doi.org/10.1016/j.inffus.2021.11.011

Bentéjac C, Csörgő A, Martínez-Muñoz G (2021) A comparative analysis of gradient boosting algorithms. Artif Intell Rev 54(3):1937–1967. https://doi.org/10.1007/s10462-020-09896-5

Zheng S, Aldahdooh J, Shadbahr T, Wang Y, Aldahdooh D, Bao J, Wang W, Tang J (2021) Drugcomb update: a more comprehensive drug sensitivity data repository and analysis portal. Nucleic Acids Res 49(W1):W174–W184. https://doi.org/10.1093/nar/gkab438

Zhu Y, Brettin T, Evrard YA, Partin A, Xia F, Shukla M, Yoo H, Doroshow JH, Stevens RL (2020) Ensemble transfer learning for the prediction of anti-cancer drug response. Sci Rep 10(1):18040. https://doi.org/10.1038/s41598-020-74921-0

Zhang Y, Jiang Z, Chen C, Wei Q, Gu H, Yu B (2022) Deepstack-DTIs: predicting drug-target interactions using LightGBM feature selection and deep-stacked ensemble classifier. Interdiscip Sci Comput Life Sci 14(2):311–330. https://doi.org/10.1007/s12539-021-00488-7

Wu Z, Ramsundar B, Feinberg EN, Gomes J, Geniesse C, Pappu AS, Leswing K, Pande V (2018) MoleculeNet: a benchmark for molecular machine learning. Chem Sci 9(2):513–530. https://doi.org/10.1039/C7SC02664A

Siramshetty VB, Nguyen D-T, Martinez NJ, Southall NT, Simeonov A, Zakharov AV (2020) Critical analysis. J Chem Inf Model 60(12):6007–6019. https://doi.org/10.1021/acs.jcim.0c00884

Boldini D, Friedrich L, Kuhn D, Sieber SA (2022) Tuning gradient boosting for imbalanced bioassay modelling with custom loss functions. J Cheminf 14(1):80. https://doi.org/10.1186/s13321-022-00657-w

van Tilborg D, Alenicheva A, Grisoni F (2022) Exposing the limitations of molecular machine learning with activity cliffs. J Chem Inf Model 62(23):5938–5951. https://doi.org/10.1021/acs.jcim.2c01073

Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM San Francisco California USA. 2016. https://doi.org/10.1145/2939672.2939785

Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu T-Y (2017) LightGBM: a highly efficient gradient boosting decision tree in advances in neural information processing systems. Curran Assoc. https://doi.org/10.48550/arXiv.1706.09516

Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A (2018) CatBoost: unbiased boosting with categorical features. Adv Neural Inf Process Sys. https://doi.org/10.48550/arXiv.1706.09516

Esposito C, Landrum GA, Schneider N, Stiefl N, Riniker S (2021) GHOST: adjusting the decision threshold to handle imbalanced data in machine learning. J Chem Inf Model 61(6):2623–2640. https://doi.org/10.1021/acs.jcim.1c00160

Dahlin JL, Nissink JWM, Strasser JM, Francis S, Higgins L, Zhou H, Zhang Z, Walters MA (2015) PAINS in the assay: chemical mechanisms of assay interference and promiscuous enzymatic inhibition observed during a sulfhydryl-scavenging HTS. J Med Chem 58(5):2091–2113. https://doi.org/10.1021/jm5019093

Breiman L (2017) Classification and regression trees. Routledge, New York

Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232. https://doi.org/10.1214/aos/1013203451

Pedregosa F (2012) Scikit-learn: machine learning in python. Mach Learn. https://doi.org/10.48550/arXiv.1201.0490

XGBoost Documentation—xgboost 1.6.2 documentation. https://xgboost.readthedocs.io/en/stable/. Accessed 31 Aug 2022

Welcome to LightGBM’s documentation!—LightGBM 3.3.2 documentation. https://lightgbm.readthedocs.io/en/v3.3.2/. Accessed 31 Aug 2022

Todeschini R, Consonni V (2000) Handbook of molecular descriptors. Methods Princ Med Chem. https://doi.org/10.1002/9783527613106

CatBoost - state-of-the-art open-source gradient boosting library with categorical features support. https://catboost.ai. Accessed 31 Aug 2022

Ustimenko A, Beliakov A, Prokhorenkova L (2022) Gradient boosting performs gaussian process inference. ArXiv. https://doi.org/10.48550/arXiv.2206.05608

Ustimenko, A.; Prokhorenkova, L. SGLB: Stochastic Gradient Langevin Boosting. http://arxiv.org/abs/2001.07248. Accessed 20 May 2022.

Sharchilev, B.; Ustinovsky, Y.; Serdyukov, P.; de Rijke, M. Finding Influential Training Samples for Gradient Boosted Decision Trees. arXiv March 12, 2018. http://arxiv.org/abs/1802.06640 Accessed 29 Jul 2022

Cortés-Ciriano I, Bender A (2019) Deep confidence: a computationally efficient framework for calculating reliable prediction errors for deep neural networks. J Chem Inf Model 59(3):1269–1281. https://doi.org/10.1021/acs.jcim.8b00542

Fu G, Yi L, Pan J (2019) Tuning model parameters in class-imbalanced learning with precision-recall curve. Biom J 61(3):652–664. https://doi.org/10.1002/bimj.201800148

Feng Y, Zhou M, Tong X Imbalanced classification: a paradigm-based review. http://arxiv.org/abs/2002.04592. Accessed 10 Oct 2022

Dunn OJ (1961) Multiple comparisons among means. J Am Stat Assoc 56(293):52–64. https://doi.org/10.2307/2282330

RDKit. https://www.rdkit.org/. Accessed 09 May 2021

Bergstra J, Komer B, Eliasmith C, Yamins D, Cox DD (2015) Hyperopt: a python library for model selection and hyperparameter optimization. Comput Sci Discov 8(1):014008. https://doi.org/10.1088/1749-4699/8/1/014008

Jiménez-Luna J, Grisoni F, Schneider G (2020) Drug discovery with explainable artificial intelligence. Nat Mach Intell 2(10):573–584. https://doi.org/10.1038/s42256-020-00236-4

Shapley L (1953) A value for n-person games. In: Kuhn HW, Tucker A (eds) Contributions to the theory of games (AM-28). Princeton University Press, Princeton

Sheridan RP (2019) Interpretation of QSAR models by coloring atoms according to changes in predicted activity: how robust is it? J Chem Inf Model 59(4):1324–1337. https://doi.org/10.1021/acs.jcim.8b00825

Hutter F, Hoos H, Leyton-Brown K (2014) An Efficient Approach for Assessing Hyperparameter Importance. In Proceedings of the 31st International Conference on International Conference on Machine Learning. ICML’14; JMLR.org: Beijing, China. 32:I-754–I-762. https://dl.acm.org/doi/10.5555/3044805.3044891

Durant JL, Leland BA, Henry DR, Nourse JG (2002) Reoptimization of MDL keys for use in drug discovery. J Chem Inf Model. https://doi.org/10.1021/ci010132r

Göller AH, Kuhnke L, Montanari F, Bonin A, Schneckener S, ter Laak A, Wichard J, Lobell M, Hillisch A (2020) Bayer’s in silico ADMET platform: a journey of machine learning over the past two decades. Drug Discov Today 25(9):1702–1709. https://doi.org/10.1016/j.drudis.2020.07.001