Leveraging Uncertainty in Machine Learning Accelerates Biological Discovery and Design

Cell Systems - Tập 11 - Trang 461-477.e9 - 2020
Brian Hie1, Bryan D. Bryson2, Bonnie Berger1,3
1Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
2Ragon Institute of Massachusetts General Hospital, MIT, and Harvard, Cambridge, MA 02139, USA
3Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA

Tài liệu tham khảo

Ali, 2014, Inactivation of PI(3)K p110δ breaks regulatory T-cell-mediated immune tolerance to cancer, Nature, 510, 407, 10.1038/nature13444 Amodei, 2016, Concrete problems in AI safety, arXiv Andersson, 1979, Induction of erythroid differentiation in the human leukaemia cell line K562, Nature, 278, 364, 10.1038/278364a0 Andreu, 2010, Optimisation of bioluminescent reporters for use with mycobacteria, PLoS One, 5, e10777, 10.1371/journal.pone.0010777 Auer, 2003, Using confidence bounds for exploitation-exploration trade-offs, Journal of Machine Learning Research, 3, 397 Bacon, 1620 Barondeau, 2003, Mechanism and energetics of green fluorescent protein chromophore synthesis revealed by trapped intermediate structures, Proc. Natl. Acad. Sci. USA, 100, 12111, 10.1073/pnas.2133463100 Bedbrook, 2019, Machine learning-guided channelrhodopsin engineering enables minimally invasive optogenetics, Nat. Methods, 16, 1176, 10.1038/s41592-019-0583-8 Benjamini, 1995, Controlling the false discovery rate: a practical and powerful approach to multiple testing, J. R. Stat. Soc. B, 57, 289 Bepler, 2019, Learning protein sequence embeddings using information from structure, arXiv Bernardo, 2009 Bielecka, 2017, A bioengineered three-dimensional cell culture platform integrated with microfluidics to address antimicrobial resistance in tuberculosis, mBio, 8, 10.1128/mBio.02073-16 Blondel, 2008, Fast unfolding of communities in large networks, J. Stat. Mech., 2008, 10008, 10.1088/1742-5468/2008/10/P10008 Bogard, 2019, A Deep Neural Network for Predicting and Engineering Alternative Polyadenylation, Cell, 178, 91, 10.1016/j.cell.2019.04.046 Bonilla, 2009, Multi-task Gaussian process prediction, 153 Brennan, 2003, Structure, function, and biogenesis of the cell wall of Mycobacterium tuberculosis, Tuberculosis, 83, 91, 10.1016/S1472-9792(02)00089-6 Butler, 2018, Integrating single-cell transcriptomic data across different conditions, technologies, and species, Nat. Biotechnol., 36, 411, 10.1038/nbt.4096 Chen, 2018, 3539 Cleary, 2017, Efficient generation of transcriptomic profiles by random composite measurements, Cell, 171, 1424, 10.1016/j.cell.2017.10.023 Cobanoglu, 2013, Predicting drug-target interactions using probabilistic matrix factorization, J. Chem. Inf. Model., 53, 3399, 10.1021/ci400219z Cormack, 1996, FACS-optimized mutants of the green fluorescent protein (GFP), Gene, 173, 33, 10.1016/0378-1119(95)00685-0 Cortes, 2018, Cold-start recommendations in collective matrix factorization, arXiv Cortes-Ciriano, 2015, Comparing the influence of simulated experimental errors on 12 machine learning algorithms in bioactivity modeling using 12 diverse data sets, J. Chem. Inf. Model., 55, 1413, 10.1021/acs.jcim.5b00101 Davis, 2011, Comprehensive analysis of kinase inhibitor selectivity, Nat. Biotechnol., 29, 1046, 10.1038/nbt.1990 Deng, 2019, Scalable analysis of cell-type composition from single-cell transcriptomics using deep recurrent learning, Nat. Methods, 16, 311, 10.1038/s41592-019-0353-7 Eisenstein, 2020, Active machine learning helps drug hunters tackle biology, Nat. Biotechnol., 38, 512, 10.1038/s41587-020-0521-4 Erhan, 2010, Why does unsupervised pre-training help deep learning?, J. Mach. Learn. Res., 11, 625 Ewing, 1998, Base-calling of automated sequencer traces using phred. I. Accuracy assessment, Genome Res., 8, 175, 10.1101/gr.8.3.175 Fernandez, 2006, The Ser/Thr protein kinase PknB is essential for sustaining mycobacterial growth, J. Bacteriol., 188, 7778, 10.1128/JB.00963-06 Furin, 2019, Tuberculosis. Lancet, 393, 1642, 10.1016/S0140-6736(19)30308-3 Gardner, J.R., Pleiss, G., Bindel, D., Weinberger, K.Q., and Wilson, A.G. (2018). GPyTorch: blackbox matrix-matrix Gaussian process inference with GPU acceleration. 32nd Conference on Neural Information Processing Systems, pp. 7576–7586. Görtler, 2019 Grande, 2014, Sample efficient reinforcement learning with Gaussian processes, 1332 Grangeasse, 2012, Bacterial tyrosine kinases: evolution, biological function and structural insights, Philos. Trans. R. Soc. Lond. B Biol. Sci., 367, 2640, 10.1098/rstb.2011.0424 Guo, 2017, On calibration of modern neural networks, 1321 Hie, 2019, Efficient integration of heterogeneous single-cell transcriptomes using Scanorama, Nat. Biotechnol., 37, 685, 10.1038/s41587-019-0113-3 Hie, 2018, Realizing private and practical pharmacological collaboration, Science, 362, 347, 10.1126/science.aat4807 Hie, 2019, Geometric sketching compactly summarizes the single-cell transcriptomic landscape, Cell Syst., 8, 483, 10.1016/j.cels.2019.05.003 Hie, 2020, Computational methods for single-cell RNA sequencing, Annu. Rev. Biomed. Data Sci., 3, 339, 10.1146/annurev-biodatasci-012220-100601 Hoffmann, 2008, Disclosure of the mycobacterial outer membrane: cryo-electron tomography and vitreous sections reveal the lipid bilayer structure, Proc. Natl. Acad. Sci. USA, 105, 3963, 10.1073/pnas.0709530105 2020, Pan-cancer analysis of whole genomes, Nature, 578, 82, 10.1038/s41586-020-1969-6 Irwin, 2005, Zinc - A free database of commercially available compounds for virtual screening, J. Chem. Inf. Model., 45, 177, 10.1021/ci049714+ Jackson, 2018, Discovery and development of new antibacterial drugs: learning from experience?, J. Antimicrob. Chemother., 73, 1452, 10.1093/jac/dky019 Jacomy, 2014, ForceAtlas2, a continuous graph layout algorithm for handy network visualization designed for the Gephi software, PLoS One, 9, e98679, 10.1371/journal.pone.0098679 Jiang, 2020, Drug-target affinity prediction using graph neural network and contact maps, RSC Adv., 10, 20701, 10.1039/D0RA02297G Jin, 2018, Junction tree variational autoencoder for molecular graph generation, Proceedings of the 35th International Conference on Machine Learning, 2328 Kawagoe, 2007, Essential role of IRAK-4 protein and its kinase activity in toll-like receptor-mediated immune responses but not in TCR signaling, J. Exp. Med., 204, 1013, 10.1084/jem.20061523 Kendall, A., and Gal, Y. (2017). What uncertainties do we need in Bayesian deep learning for computer vision? 31st Conference on Neural Information Processing Systems (NIPS 2017), pp. 5574–5584. King, 2004, Functional genomic hypothesis generation and experimentation by a robot scientist, Nature, 427, 247, 10.1038/nature02236 Kingma, 2015, Adam: a method for stochastic optimization, arXiv Kingma, 2014, Auto-encoding variational Bayes, arXiv Koes, 2013, Lessons learned in empirical scoring with smina from the CSAR 2011 benchmarking exercise, J. Chem. Inf. Model., 53, 1893, 10.1021/ci300604z Lakshminarayanan, B., Pritzel, A., and Blundell, C. (2017). Simple and scalable predictive uncertainty estimation using deep ensembles. 31st Conference on Neural Information Processing Systems (NIPS 2017), pp. 6402–6413. LeCun, 2015, Deep learning, Nature, 521, 436, 10.1038/nature14539 Lehmann, 2018, Towards the generalized iterative synthesis of small molecules, Nat. Rev. Chem., 2, 115, 10.1038/s41570-018-0115 Liao, 2002, Inhibition of constitutively active forms of mutant kit by multitargeted indolinone tyrosine kinase inhibitors, Blood, 100, 585, 10.1182/blood-2001-12-0350 Lougheed, 2011, Effective inhibitors of the essential kinase PknB and their potential as anti-mycobacterial agents, Tuberculosis, 91, 277, 10.1016/j.tube.2011.03.005 Luo, 2017, A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information, Nat. Commun., 8, 573, 10.1038/s41467-017-00680-8 Micchelli, 2006, Universal kernels, J. Mach. Learn. Res., 7, 2651 Morris, 2009, AutoDock4 and AutoDockTools4: automated docking with selective receptor flexibility, J. Comput. Chem., 30, 2785, 10.1002/jcc.21256 Mueller, 2017, Learning optimal interventions, 1039 Neal, 2012 Nguyen, 2015, Deep neural networks are easily fooled: high confidence predictions for unrecognizable images, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 427 Norinder, 2014, Introducing conformal prediction in predictive modeling. A transparent and flexible alternative to applicability domain determination, J. Chem. Inf. Model., 54, 1596, 10.1021/ci5001168 Norman, 2019, Exploring genetic interaction manifolds constructed from rich single-cell phenotypes, Science, 365, 786, 10.1126/science.aax4438 O’Boyle, 2011, Open Babel: an open chemical toolbox, J. Cheminform., 3, 33, 10.1186/1758-2946-3-33 Oliphant, 2007, SciPy: open source scientific tools for Python, Comput. Sci. Eng., 9, 10, 10.1109/MCSE.2007.58 Ong, 2009, Identifying the proteins to which small-molecule probes and drugs bind in cells, Proc. Natl. Acad. Sci. USA, 106, 4617, 10.1073/pnas.0900191106 Oppermann, 1979, Uninfected vertebrate cells contain a protein that is closely related to the product of the avian sarcoma virus transforming gene (src), Proc. Natl. Acad. Sci. USA, 76, 1804, 10.1073/pnas.76.4.1804 Ortega, 2014, Mycobacterium tuberculosis Ser/Thr protein kinase B mediates an oxygen-dependent replication switch, PLoS Biol., 12, e1001746, 10.1371/journal.pbio.1001746 Öztürk, 2018, DeepDTA: deep drug-target binding affinity prediction, Bioinformatics, 34, i821, 10.1093/bioinformatics/bty593 Palacio-Rodríguez, 2019, Exponential consensus ranking improves the outcome in docking and receptor ensemble docking, Sci. Rep., 9, 5142, 10.1038/s41598-019-41594-3 Pedregosa, 2011, Scikit-learn: machine learning in Python, J. Mach. Learn. Res., 12, 2825 Popper, 1959 Qiu, 2020, Quantifying point-prediction uncertainty in neural networks via residual estimation with an I/O Kernel, arXiv Quiroga, 2016, Vinardo: a scoring function based on autodock vina improves scoring, docking, and virtual screening, PLoS One, 11, e0155183, 10.1371/journal.pone.0155183 Rampersad, 2012, Multiple applications of alamar blue as an indicator of metabolic function and cellular health in cell viability bioassays, Sensors, 12, 12347, 10.3390/s120912347 Rasmussen, 2005 Rogers, 2010, Extended-connectivity fingerprints, J. Chem. Inf. Model., 50, 742, 10.1021/ci100050t Rood, 2019, Toward a common coordinate framework for the human body, Cell, 179, 1455, 10.1016/j.cell.2019.11.019 Ruiz-Carmona, 2014, rDock: a fast, versatile and Open source program for docking ligands to proteins and nucleic acids, PLoS Comput. Biol., 10, e1003571, 10.1371/journal.pcbi.1003571 Sarkisyan, 2016, Local fitness landscape of the green fluorescent protein, Nature, 533, 397, 10.1038/nature17995 Shalev-Shwartz, 2013 Shen, 2013, Small-molecule inducer of β cell proliferation identified by high-throughput screening, J. Am. Chem. Soc., 135, 1669, 10.1021/ja309304m Shinobu, 2010, Visualizing proton antenna in a high-resolution green fluorescent protein structure, J. Am. Chem. Soc., 132, 11093, 10.1021/ja1010652 Singh, 2008, Relational learning via collective matrix factorization, 650 Stokes, 2020, A deep learning approach to antibiotic discovery, Cell, 180, 688, 10.1016/j.cell.2020.01.021 Sverchkov, 2017, A review of active learning approaches to experimental design for uncovering biological networks, PLoS Comput. Biol., 13, e1005466, 10.1371/journal.pcbi.1005466 Tarca, 2007, Machine learning and its applications to biology, PLoS Comput. Biol., 3, e116, 10.1371/journal.pcbi.0030116 Tehranchi, 2016, Pooled ChIP-seq links variation in transcription factor binding to complex disease risk, Cell, 165, 730, 10.1016/j.cell.2016.03.041 Tran, 2016, Edward: a library for probabilistic modeling, inference, and criticism, arXiv Trott, 2010, AutoDock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading, J. Comput. Chem., 31, 455, 10.1002/jcc.21334 2019, UniProt: a worldwide hub of protein knowledge, Nucleic Acids Res., 47, D506, 10.1093/nar/gky1049 van der Maaten, 2008, Visualizing high-dimensional data using t-SNE, J. Mach. Learn. Res., 9, 2579 van Dijk, 2018, Recovering gene interactions from single-cell data using data diffusion, Cell, 174, 716, 10.1016/j.cell.2018.05.061 Vanhaesebroeck, 1997, p110delta, a novel phosphoinositide 3-kinase in leukocytes, Proc. Natl. Acad. Sci. USA, 94, 4330, 10.1073/pnas.94.9.4330 Waelchli, 2006, Design and preparation of 2-benzamido-pyrimidines as inhibitors of IKK, Bioorg. Med. Chem. Lett., 16, 108, 10.1016/j.bmcl.2005.09.035 Wang, 2009, IRAK-4 inhibitors for inflammation, Curr. Top. Med. Chem., 9, 724, 10.2174/156802609789044407 Wehenkel, 2006, The structure of PknB in complex with mitoxantrone, an ATP-competitive inhibitor, suggests a mode of protein kinase regulation in mycobacteria, FEBS Lett., 580, 3018, 10.1016/j.febslet.2006.04.046 Weininger, 1988, SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules, J. Chem. Inf. Model., 28, 31, 10.1021/ci00057a005 Wheeler, 2009, The role of Src in solid tumors, Oncologist, 14, 667, 10.1634/theoncologist.2009-0009 Wolf, 2018, SCANPY: large-scale single-cell gene expression data analysis, Genome Biol., 19, 15, 10.1186/s13059-017-1382-0 Yang, 2019, Machine-learning-guided directed evolution for protein engineering, Nat. Methods, 16, 687, 10.1038/s41592-019-0496-6 Zeng, 2019, Quantification of uncertainty in peptide-MHC binding prediction improves high-affinity peptide selection for therapeutic design, Cell Syst., 9, 159, 10.1016/j.cels.2019.05.004 Zhao, 2011, Hydrogen bonding penalty upon ligand binding, PLoS One, 6, e19923, 10.1371/journal.pone.0019923 Zheng, 2013, Collaborative matrix factorization with multiple similarities for predicting drug-target interactions, 1025 Zhou, 2020, Surface protein imputation from single cell transcriptomes by deep neural networks, Nat. Commun., 11, 651, 10.1038/s41467-020-14391-0