Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation
Tóm tắt
Generally, QSAR modelling requires both model selection and validation since there is no a priori knowledge about the optimal QSAR model. Prediction errors (PE) are frequently used to select and to assess the models under study. Reliable estimation of prediction errors is challenging – especially under model uncertainty – and requires independent test objects. These test objects must not be involved in model building nor in model selection. Double cross-validation, sometimes also termed nested cross-validation, offers an attractive possibility to generate test data and to select QSAR models since it uses the data very efficiently. Nevertheless, there is a controversy in the literature with respect to the reliability of double cross-validation under model uncertainty. Moreover, systematic studies investigating the adequate parameterization of double cross-validation are still missing. Here, the cross-validation design in the inner loop and the influence of the test set size in the outer loop is systematically studied for regression models in combination with variable selection. Simulated and real data are analysed with double cross-validation to identify important factors for the resulting model quality. For the simulated data, a bias-variance decomposition is provided. The prediction errors of QSAR/QSPR regression models in combination with variable selection depend to a large degree on the parameterization of double cross-validation. While the parameters for the inner loop of double cross-validation mainly influence bias and variance of the resulting models, the parameters for the outer loop mainly influence the variability of the resulting prediction error estimate. Double cross-validation reliably and unbiasedly estimates prediction errors under model uncertainty for regression models. As compared to a single test set, double cross-validation provided a more realistic picture of model quality and should be preferred over a single test set.
Tài liệu tham khảo
Kubinyi H: QSAR and 3D QSAR in drug design. Part 1: methodology. Drug Discov Today. 1997, 2: 457-467. 10.1016/S1359-6446(97)01079-9.
Baumann K: Cross-validation as the objective function of variable selection. Trends Anal Chem. 2003, 22: 395-406. 10.1016/S0165-9936(03)00607-1.
Todeschini R, Consonni V: Handbook of Molecular Descriptors. 2002, Wiley-VCH, Berlin
Hastie T, Tibshirani R, Friedmann J: Elements of statistical Learning: Data Mining, Inference and Prediction. 2009, Springer, New York, 2
Mosteller F, Turkey J: Data Analysis, Including Statistics. The Handbook of Social Psychology. Edited by: Gardner L, Eliot A. 1968, Springer: Addison-Wesley, Reading, MA, USA, 109-112. 2
Stone M: Cross-validatory choice and assessment of statistical predictions. J R Stat Soc Ser B Methodol. 1974, 36: 111-147.
Ganeshanandam S, Krzanowski WJ: On selecting variables and assessing their performance in linear discriminant analysis. Aust J Stat. 1989, 31: 433-447. 10.1111/j.1467-842X.1989.tb00988.x.
Jonathan P, Krzanowski WJ, McCarthy WV: On the use of cross-validation to assess performance in multivariate prediction. Stat Comput. 2000, 10: 209-229. 10.1023/A:1008987426876.
Ambroise C, McLachlan GJ: Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci U S A. 2002, 99: 6562-6566. 10.1073/pnas.102102699.
Soeria-Atmadja D, Wallman M, Björklund AK, Isaksson A, Hammerling U, Gustafsson MG: External cross-validation for unbiased evaluation of protein family detectors: application to allergens. Proteins. 2005, 61: 918-925. 10.1002/prot.20656.
Lemm S, Blankertz B, Dickhaus T, Müller KR: Introduction to machine learning for brain imaging. Neuroimage. 2011, 56: 387-399. 10.1016/j.neuroimage.2010.11.004.
Varma S, Simon R: Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics. 2006, 7: 91-10.1186/1471-2105-7-91.
Okser S, Pahikkala T, Aittokallio T: Genetic variants and their interactions in disease risk prediction - machine learning and network perspectives. BioData Min. 2013, 6: 5-10.1186/1756-0381-6-5.
Filzmoser P, Liebmann B, Varmuza K: Repeated double cross validation. J Chemom. 2009, 23: 160-171. 10.1002/cem.1225.
Wegner JK, Fröhlich H, Zell A: Feature selection for descriptor based classification models. 1. Theory and GA-SEC algorithm. J Chem Inf Comput Sci. 2004, 44: 921-930. 10.1021/ci0342324.
Anderssen E, Dyrstad K, Westad F, Martens H: Reducing over-optimism in variable selection by cross-model validation. Chemom Intell Lab Syst. 2006, 84: 69-74. 10.1016/j.chemolab.2006.04.021.
Gidskehaug L, Anderssen E, Alsberg B: Cross model validation and optimisation of bilinear regression models. Chemom Intell Lab Syst. 2008, 93: 1-10. 10.1016/j.chemolab.2008.01.005.
Krstajic D, Buturovic LJ, Leahy DE, Thomas S: Cross-validation pitfalls when selecting and assessing regression and classification models. J Cheminform. 2014, 6: 1-15. 10.1186/1758-2946-6-10.
Tetko IV, Sushko I, Pandey AK, Zhu H, Tropsha A, Papa E, Öberg T, Todeschini R, Fourches D, Varnek A: Critical assessment of QSAR models of environmental toxicity against Tetrahymena pyriformis: Focusing on applicability domain and overfitting by variable selection. J Chem Inf Model. 2008, 48: 1733-1746. 10.1021/ci800151m.
Gütlein M, Helma C, Karwath A, Kramer S: A large-scale empirical evaluation of cross-validation and external test set validation in (Q)SAR. Mol Inform. 2013, 32: 516-528. 10.1002/minf.201200134.
Zucchini W: An introduction to model selection. J Math Psychol. 2000, 44: 41-61. 10.1006/jmps.1999.1276.
Broadhurst DI, Kell DB: Statistical strategies for avoiding false discoveries in metabolomics and related experiments. Metabolomics. 2006, 2: 171-196. 10.1007/s11306-006-0037-z.
Bro R, Kjeldahl K, Smilde AK, Kiers HAL: Cross-validation of component models: a critical look at current methods. Anal Bioanal Chem. 2008, 390: 1241-1251. 10.1007/s00216-007-1790-1.
Reunanen J: Overfitting in making comparisons between variable selection methods. J Mach Learn Res. 2003, 3: 1371-1382.
Hawkins DM: The problem of overfitting. J Chem Inf Comput Sci. 2004, 44: 1-12. 10.1021/ci0342472.
Cawley GC, Talbot NLC: On over-fitting in model selection and subsequent selection bias in performance evaluation. J Mach Learn Res. 2010, 11: 2079-2107.
Baumann K: Chance correlation in variable subset regression: Influence of the objective function, the selection mechanism, and ensemble averaging. QSAR Comb Sci. 2005, 24: 1033-1046. 10.1002/qsar.200530134.
Baumann K, Stiefl N: Validation tools for variable subset regression. J Comput Aided Mol Des. 2004, 18: 549-562. 10.1007/s10822-004-4071-5.
Lukacs PM, Burnham KP, Anderson DR: Model selection bias and Freedman’s paradox. Ann Inst Stat Math. 2009, 62: 117-125. 10.1007/s10463-009-0234-4.
Johnson JB, Omland KS: Model selection in ecology and evolution. Trends Ecol Evol. 2004, 19: 101-108. 10.1016/j.tree.2003.10.013.
Miller A: Subset Selection in Regression. 2002, Chapmann & Hall/CRC, New York, 2
Chirico N, Gramatica P: Real external predictivity of QSAR models: how to evaluate it? Comparison of different validation criteria and proposal of using the concordance correlation coefficient. J Chem Inf Model. 2011, 51: 2320-2335. 10.1021/ci200211n.
Gramatica P: Principles of QSAR models validation: internal and external. QSAR Comb Sci. 2007, 26: 694-701. 10.1002/qsar.200610151.
Scior T, Medina-Franco JL, Do Q-T, Martínez-Mayorga K, Yunes Rojas JA, Bernard P: How to recognize and workaround pitfalls in QSAR studies: a critical review. Curr Med Chem. 2009, 16: 4297-4313. 10.2174/092986709789578213.
Aptula AO, Jeliazkova NG, Schultz TW, Cronin MTD: The better predictive model: High q2 for the training set or low root mean square error of prediction for the test set?. QSAR Comb Sci. 2005, 24: 385-396. 10.1002/qsar.200430909.
Tropsha A, Gramatica P, Gombar VK: The importance of being earnest: validation is the absolute essential for successful application and interpretation of QSPR models. QSAR Comb Sci. 2003, 22: 69-77. 10.1002/qsar.200390007.
Justice AC, Covinsky KE, Berlin JA: Assessing the generalizability of prognostic information. Ann Intern Med. 1999, 130: 515-524. 10.7326/0003-4819-130-6-199903160-00016.
Hawkins DM, Basak SC, Mills D: Assessing model fit by cross-validation. J Chem Inf Comput Sci. 2003, 43: 579-586. 10.1021/ci025626i.
Harrell Frank E: Model Validation. Regression Modeling Strategies: With Application to Linear Models, Logistic Regression, and Survival Analysis. 2001, Springer Science and Business Inc, New York, 90-10.1007/978-1-4757-3462-1.
Faber N, Klaas M: Estimating the uncertainty in estimates of root mean square error of prediction: application to determining the size of an adequate test set in multivariate calibration. Chemom Intell Lab Syst. 1999, 49: 79-89. 10.1016/S0169-7439(99)00027-1.
Roecker EB: Prediction error and its estimation for subset-selected models. Technometrics. 1991, 33: 459-468. 10.1080/00401706.1991.10484873.
Hawkins DM, Kraker JJ: Determinstic fallacies and model validation. J Chem Inf Model. 2010, 24: 188-193.
Efron B, Tibshirani RJ: An Introduction to the Bootstrap. 1993, Chapman & Hall/CRC, New York
Eklund M, Spjuth O, Wikberg JE: The C1C2: a framework for simultaneous model selection and assessment. BMC Bioinformatics. 2008, 9: 360-373. 10.1186/1471-2105-9-360.
Breiman L: Random forests. Mach Learn. 2001, 45: 5-32. 10.1023/A:1010933404324.
Baumann K, Albert H, von Korff M: A systematic evaluation of the benefits and hazards of variable selection in latent variable regression. Part I. Search algorithm, theory and simulations. J Chemom. 2002, 16: 339-350. 10.1002/cem.730.
Arlot S, Celisse A: A survey of cross-validation procedures for model selection. Stat Surv. 2010, 4: 40-79. 10.1214/09-SS054.
Browne M: Cross-validation methods. J Math Psychol. 2000, 44: 108-132. 10.1006/jmps.1999.1279.
Shao J: Linear model selection by cross-validation. J Am Stat Assoc. 1993, 88: 486-494. 10.1080/01621459.1993.10476299.
Briscoe E, Feldman J: Conceptual complexity and the bias/variance tradeoff. Cognition. 2011, 118: 2-16. 10.1016/j.cognition.2010.10.004.
Freyhult E, Prusis P, Lapinsh M, Wikberg JE, Moulton V, Gustafsson MG: Unbiased descriptor and parameter selection confirms the potential of proteochemometric modelling. BMC Bioinformatics. 2005, 6: 50-64. 10.1186/1471-2105-6-50.
Lise S, Buchan D, Pontil M, Jones DT: Predictions of hot spot residues at protein-protein interfaces using support vector machines. PLoS ONE. 2011, 6: e16774-10.1371/journal.pone.0016774.
Statnikov A, Wang L, Aliferis CF: A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics. 2008, 9: 319-10.1186/1471-2105-9-319.
Asgharzadeh S, Pique-Regi R, Sposto R, Wang H, Yang Y, Shimada H, Matthay K, Buckley J, Ortega A, Seeger RC: Prognostic significance of gene expression profiles of metastatic neuroblastomas lacking MYCN gene amplification. J Natl Cancer Inst. 2006, 98: 1193-1203. 10.1093/jnci/djj330.
Lottaz C, Spang R: Molecular decomposition of complex clinical phenotypes using biologically structured analysis of microarray data. Bioinformatics. 2005, 21: 1971-1978. 10.1093/bioinformatics/bti292.
Smit S, van Breemen MJ, Hoefsloot HCJ, Smilde AK, Aerts JMFG, de Koster CG: Assessing the statistical validity of proteomics based biomarkers. Anal Chim Acta. 2007, 592: 210-217. 10.1016/j.aca.2007.04.043.
Tibshirani R: Regression shrinkage and selection via the lasso. J R Stat Soc Ser B Methodol. 1996, 58: 267-288.
Yuan M, Lin Y: On the non-negative garrotte estimator. J R Stat Soc Ser B Statistical Methodol. 2007, 69: 143-161. 10.1111/j.1467-9868.2007.00581.x.
Huuskonen J: Estimation of aqueous solubility for a diverse set of organic compounds based on molecular topology. J Chem Inf Comput Sci. 2000, 40: 773-777. 10.1021/ci9901338.
Yap CW: PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints. J Comput Chem. 2011, 32: 1466-1474. 10.1002/jcc.21707.
Zuber V, Strimmer K: High-dimensional regression and variable selection using CAR scores. Stat Appl Genet Mol Biol. 2010, 10: 25-
Guha R, Jurs PC: Development of QSAR models to predict and interpret the biological activity of artemisinin analogues. J Chem Inf Comput Sci. 2004, 44: 1440-1449. 10.1021/ci0499469.
Hong H, Xie Q, Ge W, Qian F, Fang H, Shi L, Su Z, Perkins R, Tong W: Mold(2), molecular descriptors from 2D structures for chemoinformatics and toxicoinformatics. J Chem Inf Model. 2008, 48: 1337-1344. 10.1021/ci800038f.
Golbraikh A, Tropsha A: Beware of q2!. J Mol Graph Model. 2002, 20: 269-276. 10.1016/S1093-3263(01)00123-1.
Christensen R: Plane Answers to Complex Questions. 1996, Springer, New York, 2
Clarke K: The phantom menace: omitted variable bias in econometric research. Confl Manag Peace Sci. 2005, 22: 341-352. 10.1080/07388940500339183.
Marbach R, Heise HM: Calibration modeling by partial least-squares and principal component regression and its optimization using an improved leverage correction for prediction testing. Chemom Intell Lab Syst. 1990, 9: 45-63. 10.1016/0169-7439(90)80052-8.
Efron B, Tibshirani R: Improvements on cross-validation: the .632+ bootstrap method. J Am Stat Assoc. 1997, 92: 548-560.
Breiman L, Spector P: Submodel selection and evaluation in regression. The X-random case. Int Stat Rev. 1992, 60: 291-319. 10.2307/1403680.
Xu H, Caramanis C, Mannor S: Robust regression and lasso. IEEE Trans Inf Theory. 2010, 56: 3561-3574. 10.1109/TIT.2010.2048503.
Bühlmann P, van de Geer SA: Statistics for High-Dimensional Data Methods, Theory and Applications. 2011, Springer, New York
R: A Language and Environment for Statistical Computing. 2011, R Foundation for Statistical Computing, Vienna, Austria