Sample size for binary logistic prediction models: Beyond events per variable criteria

Statistical Methods in Medical Research - Tập 28 Số 8 - Trang 2455-2474 - 2019
Maarten van Smeden1, Karel G. M. Moons1, Joris A. H. de Groot1, Gary S. Collins2, Douglas G. Altman2, Marinus J.C. Eijkemans1, Johannes B. Reitsma1
1Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht, The Netherlands
2Centre for Statistics in Medicine, Botnar Research Centre, University of Oxford, Oxford, UK

Tóm tắt

Binary logistic regression is one of the most frequently applied statistical approaches for developing clinical prediction models. Developers of such models often rely on an Events Per Variable criterion (EPV), notably EPV ≥10, to determine the minimal sample size required and the maximum number of candidate predictors that can be examined. We present an extensive simulation study in which we studied the influence of EPV, events fraction, number of candidate predictors, the correlations and distributions of candidate predictor variables, area under the ROC curve, and predictor effects on out-of-sample predictive performance of prediction models. The out-of-sample performance (calibration, discrimination and probability prediction error) of developed prediction models was studied before and after regression shrinkage and variable selection. The results indicate that EPV does not have a strong relation with metrics of predictive performance, and is not an appropriate criterion for (binary) prediction model development studies. We show that out-of-sample predictive performance can better be approximated by considering the number of predictors, the total sample size and the events fraction. We propose that the development of new sample size criteria for prediction models should be based on these three parameters, and provide suggestions for improving sample size determination.

Từ khóa


Tài liệu tham khảo

10.1371/journal.pmed.1001221

10.7326/M14-0698

10.7326/M14-0697

10.1002/(SICI)1097-0258(20000229)19:4<453::AID-SIM350>3.0.CO;2-5

10.1136/heartjnl-2011-301247

10.1002/sim.4780030207

10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4

10.1007/978-1-4757-3462-1

10.1177/0272989X0102100106

10.1007/978-0-387-77244-8

10.1002/(SICI)1097-0258(20000430)19:8<1059::AID-SIM412>3.0.CO;2-0

10.1016/S0895-4356(03)00047-7

10.1002/sim.1422

10.1371/journal.pmed.1001744

10.1002/sim.6782

10.1136/bmj.h3868

10.1136/heartjnl-2011-301246

10.1016/j.jclinepi.2010.11.012

10.1186/s12874-016-0267-3

10.1016/j.jclinepi.2016.02.031

10.1002/sim.7273

10.1002/sim.2771

10.2307/2347628

Tibshirani R, 1996, J Royal Stat Soc Ser B (Stat Methodol), 58, 267, 10.1111/j.2517-6161.1996.tb02080.x

10.1093/biomet/80.1.27

10.1002/sim.4780091109

10.3102/10769986017004315

10.1016/S0377-2217(98)00392-0

10.1002/0471249688

James W and Stein C. Estimation with quadratic loss. In: Proceedings of the fourth Berkeley symposium on mathematical statistics and probability. Berkeley, CA: University of California Press, 1961, pp.361–379.

10.1038/scientificamerican0577-119

10.1093/biomet/54.1-2.181

10.2307/2531395

10.1016/S0895-4356(96)00236-3

10.1093/aje/kwk052

10.1186/1471-2288-9-56

10.1093/biomet/71.1.1

10.1002/sim.1047

10.1080/00401706.1970.10488634

10.1080/00401706.1970.10488701

10.1002/sim.4780080702

10.1002/sim.4780111607

10.1002/sim.4780091109

10.1186/s12874-016-0209-0

10.1002/sim.2687

10.1007/978-0-387-84858-7

10.18637/jss.v033.i01

10.1186/s12874-017-0313-9

10.1111/j.1467-9868.2005.00503.x

10.1148/radiology.143.1.7063747

10.1002/sim.6787

10.1093/biomet/45.3-4.562

10.1177/0272989X9301300107

10.1097/EDE.0b013e3181c30fb2

10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2

R Core Team. A language and environment for statistical computing, http://www.r-project.org/ (2014, accessed 24 April 2018).

10.1007/978-0-387-21706-2

10.1214/08-AOAS191

10.1002/sim.6537

10.1080/00401706.1995.10484371

10.1023/A:1010933404324