From complex data to biological insight: ‘DEKER’ feature selection and network inference

Journal of Pharmacokinetics and Pharmacodynamics - Tập 49 - Trang 81-99 - 2021
Sean M. S. Hayes1, Jeffrey R. Sachs1, Carolyn R. Cho1
1Quantitative Pharmacology and Pharmacometrics, Merck & Co., Inc., Kenilworth, USA

Tóm tắt

Network inference is a valuable approach for gaining mechanistic insight from high-dimensional biological data. Existing methods for network inference focus on ranking all possible relations (edges) among all measured quantities such as genes, proteins, metabolites (features) observed, which yields a dense network that is challenging to interpret. Identifying a sparse, interpretable network using these methods thus requires an error-prone thresholding step which compromises their performance. In this article we propose a new method, DEKER-NET, that addresses this limitation by directly identifying a sparse, interpretable network without thresholding, improving real-world performance. DEKER-NET uses a novel machine learning method for feature selection in an iterative framework for network inference. DEKER-NET is extremely flexible, handling linear and nonlinear relations while making no assumptions about the underlying distribution of data, and is suitable for categorical or continuous variables. We test our method on the Dialogue for Reverse Engineering Assessments and Methods (DREAM) challenge data, demonstrating that it can directly identify sparse, interpretable networks without thresholding while maintaining performance comparable to the hypothetical best-case thresholded network of other methods.

Tài liệu tham khảo

Akaike H (1998) Information theory and an extension of the maximum likelihood principle. In: Selected papers of Hirotugu Akaike. Springer, New York, pp 199–213 Bjerhammar A (1951) Application of calculus of matrices to method of least squares: with special reference to geodetic calculations. Elander, Göteborg Breiman L (2001) Random forests. Mach Learn 45(1):5–32 Chapelle O (2007) Training a support vector machine in the primal. Neural Comput 19(5):1155–1178 Cully A, Chatzilygeroudis K, Allocati F, Mouret JB (2018) Limbo: a flexible high-performance library for Gaussian processes modeling and data-efficient optimization. J Open Source Softw. https://doi.org/10.21105/joss.00545 Duan K, Keerthi SS, Poo AN (2003) Evaluation of simple performance measures for tuning SVM hyperparameters. Neurocomputing 51:41–59 Greenfield A, Madar A, Ostrer H, Bonneau R (2010) DREAM4: combining genetic and dynamic information to identify biological networks and dynamical models. PLoS ONE 5(10):e13397 Harrington SA, Backhaus AE, Singh A, Hassani-Pak K, Uauy C (2020) The wheat GENIE3 network provides biologically-relevant information in polyploid wheat. G3 10(10):3675–3686 Hastie TJ, Tibshirani RJ (1990) Generalized additive models, vol 43. CRC Press, Boca Raton Haury AC, Mordelet F, Vera-Licona P, Vert JP (2012) TIGRESS: trustful inference of gene regulation using stability selection. BMC Syst Biol 6(1):145 Hayes S, Bloomingdale P (2020) Network inference of omics data using machine learning to inform QSP model development. Oral presentation at American Conference on Pharmacometrics 2020 annual meeting Hayes S, Swaminathan G, White C, Cristescu R, Citron M, Sachs J, Thakur G, Aliprantis A, Cho C (2019) Understanding the role of the microbiome in vaccine in the elderly using machine learning and quantitative systems pharmacology. Poster presented at the 2019 American Society for Clinical Pharmacology and Therapeutics annual meeting and quantitative systems pharmacology pre-conference Henrici P (1964) Elements of numerical analysis. Wiley, Hoboken, pp 115–116 Huang J, Zheng J, Yuan H, McGinnis K (2018) Distinct tissue-specific transcriptional regulation revealed by gene regulatory networks in maize. BMC Plant Biol 18(1):1–14 Huynh-Thu V (2018) GENIE3 R package. https://github.com/vahuynh/GENIE3. Accessed 27 Apr 2021 Huynh-Thu VA, Irrthum A, Wehenkel L, Geurts P et al (2010) Inferring regulatory networks from expression data using tree-based methods. PLoS ONE 5(9):e12776 Johnstone IM, Titterington DM (2009) Statistical challenges of high-dimensional data. Phil Trans R Soc A 367:4237–4253. https://doi.org/10.1098/rsta.2009.0159 Köppen M (2000) The curse of dimensionality. In: 5th online world conference on soft computing in industrial applications (WSC5), vol 1. pp 4–8 Lin HT, Lin CJ, Weng RC (2007) A note on Platt’s probabilistic outputs for support vector machines. Mach Learn 68(3):267–276 Marbach D, Costello JC, Küffner R, Vega NM, Prill RJ, Camacho DM, Allison KR, Aderhold A, Bonneau R, Chen Y et al (2012) Wisdom of crowds for robust gene network inference. Nat Methods 9(8):796 Marbach D, Schaffter T, Mattiussi C, Floreano D (2009) Generating realistic in silico gene networks for performance assessment of reverse engineering methods. J Comput Biol 16(2):229–239 Močkus J (1975) On Bayesian methods for seeking the extremum. In: Optimization techniques IFIP technical conference. Springer, Berlin, pp 400–404 Moore EH (1920) On the reciprocal of the general algebraic matrix. Bull Am Math Soc 26:394–395 Navot A, Shpigelman L, Tishby N, Vaadia E (2005) Nearest neighbor based feature selection for regression and its application to neural activity. Adv Neural Inf Process Syst 18:996–1002 Penrose R (1955) A generalized inverse for matrices. In: Mathematical proceedings of the Cambridge philosophical society, vol 51. Cambridge University Press, Cambridge, pp 406–413 Platt J et al (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv Large Margin Classif 10(3):61–74 Ramírez-González R, Borrill P, Lang D, Harrington S, Brinton J, Venturini L, Davey M, Jacobs J, Van Ex F, Pasha A et al (2018) The transcriptional landscape of polyploid wheat. Science. https://doi.org/10.1126/science.aar6089 Riedmiller M, Braun H (1993) A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In: IEEE international conference on neural networks. IEEE, pp 586–591 Schaffter T, Marbach D, Floreano D (2011) GeneNetWeaver: in silico benchmark generation and performance profiling of network inference methods. Bioinformatics 27(16):2263–2270 Schwarz G et al (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464 Sun Y, Yao J, Goodison S (2015) Feature selection for nonlinear regression and its application to cancer research. In: Proceedings of the 2015 SIAM international conference on data mining. SIAM, pp 73–81 Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58(1):267–288 Vert J (2019) TIGRESS R package. https://github.com/jpvert/tigress. Accessed 27 Apr 2021 Walley JW, Sartor RC, Shen Z, Schmitz RJ, Wu KJ, Urich MA, Nery JR, Smith LG, Schnable JC, Ecker JR et al (2016) Integration of omic networks in a developmental atlas of maize. Science 353(6301):814–818 Zhu C, Byrd RH, Lu P, Nocedal J (1997) Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Trans Math Softw (TOMS) 23(4):550–560