State of the art in selection of variables and functional forms in multivariable analysis—outstanding issues

Diagnostic and Prognostic Research - Tập 4 Số 1 - 2020
Willi Sauerbrei1, Aris Perperoglou2, Matthias Schmid3, Michal Abrahamowicz4, Heiko Becher5, Harald Binder6, Daniela Dunkler7, Frank E. Harrell8, Patrick Royston9, Georg Heinze7
1Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany
2Data Science and Artificial Intelligence AstraZeneca, Cambridge, UK.
3Department of Medical Biometry, Informatics and Epidemiology, Faculty of Medicine, University of Bonn, Bonn, Germany
4McGill University Health Centre, McGill University, Montreal, Canada
5Institute for Medical Biometry and Epidemiology, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
6Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Freiburg, Germany
7Section for Clinical Biometrics, Center for Medical Statistics, Informatics, and Intelligent Systems, Medical University of Vienna, Vienna, Austria
8Department of Biostatistics, School of Medicine, Vanderbilt University, Nashville, USA
9MRC Clinical Trials Unit at UCL, Institute of Clinical Trials and Methodology, University College London, London, UK

Tóm tắt

AbstractBackgroundHow to select variables and identify functional forms for continuous variables is a key concern when creating a multivariable model. Ad hoc ‘traditional’ approaches to variable selection have been in use for at least 50 years. Similarly, methods for determining functional forms for continuous variables were first suggested many years ago. More recently, many alternative approaches to address these two challenges have been proposed, but knowledge of their properties and meaningful comparisons between them are scarce. To define a state of the art and to provide evidence-supported guidance to researchers who have only a basic level of statistical knowledge, many outstanding issues in multivariable modelling remain. Our main aims are to identify and illustrate such gaps in the literature and present them at a moderate technical level to the wide community of practitioners, researchers and students of statistics.MethodsWe briefly discuss general issues in building descriptive regression models, strategies for variable selection, different ways of choosing functional forms for continuous variables and methods for combining the selection of variables and functions. We discuss two examples, taken from the medical literature, to illustrate problems in the practice of modelling.ResultsOur overview revealed that there is not yet enough evidence on which to base recommendations for the selection of variables and functional forms in multivariable analysis. Such evidence may come from comparisons between alternative methods. In particular, we highlight seven important topics that require further investigation and make suggestions for the direction of further research.ConclusionsSelection of variables and of functional forms are important topics in multivariable analysis. To define a state of the art and to provide evidence-supported guidance to researchers who have only a basic level of statistical knowledge, further comparative research is required.

Từ khóa


Tài liệu tham khảo

Abrahamowicz M, du Berger R, Grover SA. Flexible modelling of the effects of serum cholesterol on coronary heart disease mortality. Am J Epidemiol. 1997;145:714–29.

Altman DG, Andersen PK. Bootstrap investigation of the stability of a Cox regression model. Stat Med. 1989;8:771–83.

Altman DG, Lausen B, Sauerbrei W, Schumacher M. The dangers of using ‘optimal’cutpoints in the evaluation of prognostic factors. J Nat Cancer Inst. 1994;86:829–35.

Altman DG, McShane LM, Sauerbrei W, Taube SE. Reporting recommendations for tumor marker prognostic studies (REMARK): explanation and elaboration. PLoS Med. 2012;9:e1001216.

Antoniadis A, Gijbels I, Verhasselt A. Variable selection in additive models using P-splines. Technometrics. 2012;54:425–38.

Arem H, Moore SC, Patel A, Hartge P, Berrington DE, Gonzalez A, Visvanathan K, Campbell PT, Freedman M, Weiderpass E, Adami HO, Linet MS, Lee IM, Matthews CE. Leisure Time physical activity and mortality. A detailed pooled analysis of the dose-response relationship. JAMA Intern Med. 2015;175:959–67.

Augustin N, Sauerbrei W, Schumacher M. The practical utility of incorporating model selection uncertainty into prognostic models for survival data. Stat Model. 2015;5:95–118.

Becher H. Analysis of continuous covariates and dose-effect analysis. In: Ahrens W, Pigeot I (Eds) Handbook of epidemiology. 2nd edition. Heidelberg: Springer Verlag; 2014.

Becher H, Lorenz E, Royston P, Sauerbrei W. Analysing covariates with spike at zero. a modified FP procedure and conceptual issues. Biometrical J. 2012;54:686–700.

Benedetti A, Abrahamowicz M. Using generalized additive models to reduce residual confounding. Stat Med. 2004;23:3781–801.

Binder H, Schumacher M. Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models. BMC Bioinform. 2008;9:14.

Binder H, Sauerbrei W, Royston P. Comparison between splines and fractional polynomials for multivariable model building with continuous covariates: a simulation study with continuous response. Stat Med. 2013;32:2262–77.

Boulesteix AL, Binder H, Abrahamowicz M, Sauerbrei W. On the necessity and design of studies comparing statistical methods. Biometrical J. 2018;60:216–8.

Breiman L. The little bootstrap and other methods for dimensionality selection in regression: X-fixed prediction error. J Am Stat Assoc. 1992;87:738–54.

Breiman L. Better subset regression using the non- negative Garrote. Technometrics. 1995;37:373–84.

Breiman L. Statistical Modeling: The two cultures. Stat Sci. 2001;16:199–231.

Buckland ST, Burnham KP, Augustin NH. Model selection: an integral part of inference. Biometrics. 1997;53:603–18.

Bühlmann P. Hothorn. Boosting algorithms: regularization, prediction and model fitting. Stat Sci. 2007;22:477–505.

Burnham KP, Anderson DR. Model selection and multimodel inference: a practical information- theoretic approach. New York: Springer; 2002.

Bursac Z, Gauss CH, Williams DK, Hosmer DW. Purposeful selection of variables in logistic regression. Source Code Biol Med. 2008;3:17.

Chatfield C. Model uncertainty, data mining and statistical inference (with discussion). J Royal Stat Soc Series B. 1995;158:419–66.

Chatield C. Confessions of a pragmatic statistician. Statistician. 2002;51:1–20.

Chen C, George SL. The bootstrap and identification of prognostic factors via Cox’s proportional hazards regression model. Stat Med. 1985;4:39–46.

Chouldechova A, Hastie T. Generalized additive model selection. arXiv preprint 2015;arXiv:1506.03850.

Copas JB, Long T. Estimating the residual variance in orthogonal regression with variable selection. Journal of the Royal Statistical Society. Series D (The Statistician). 1991;40:51-59.

Cox DR. Comment on Breiman, L. (2001). Statistical modeling: the two cultures. Stat Sci. 2001;16:216–8.

Dakna M, Harris K, Kalousi A, Carpentier S, Kolch W, Schanstra JP, Haubitz M, Vlahou A, Mischak H, Girolami M. Addressing the challenge of defining valid proteomic biomarkers and classifiers. BMC Bioinform. 2010;11:594.

de Bin R, Sauerbrei W. Handling co-dependence issues in resampling-based variable selection procedures: a simulation study. J Stat Comput Simul. 2018:8828–55.

de Bin R, Janitza S, Sauerbrei W, Boulesteix AL. Subsampling versus bootstrapping in resampling-based model selection for multivariable regression. Biometrics. 2016;72:272–80.

de Boor C. A practical guide to splines revised. Revised Edition. New York: Springer; 2001.

Dorie V, Hill J, Shalit U, Scott M, Cervone D. Automated versus do-it-yourself methods for causal inference: lessons learned from a data analysis competition. Stat Sci. 2019;34:43–68.

Draper D. Assessment and propagation of model selection uncertainty (with) discussion. J Royal Stat Soc Series B. 1995;57:45–97.

Dunkler D, Plischke M, Leffondré K, Heinze G. Augmented backward elimination: a pragmatic and purposeful way to develop statistical models. PLoS ONE. 2014;9:e113677.

Dunkler D, Sauerbrei W, Heinze G. Global, parameterwise and joint shrinkage factor estimation. J Stat Softw. 2016;69:1–19.

Efron B. Comment on Breiman, L. (2001). Statistical modeling: the two cultures. Stat Sci. 2001;16:218–9.

Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Ann Statist. 2004;32:407–99.

Efroymson MA. Multiple regression analysis. in: Ralston A and Wilf HS(ed.). Mathematical methods for digital computers. John Wiley. New York; 1960.

Eilers PHC, Marx BD. Flexible smoothing with B-splines and penalties (with comments and rejoinder). Stat Sci. 1996;11:89–121.

Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc. 2001;96:1348–60.

Freund Y, Schapire R. Experiments with a new boosting algorithm. In: Proceedings of the Thirteenth International Conference on Machine Learning Theory. San Francisco, CA: Morgan Kaufmann Publishers Inc; 1996.

Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001;29:1189–232.

Friedman JH, Hastie T, Tibshirani R. Additive logistic regression: a statistical view of boosting (with discussion). Ann Stat. 2000;28:337–407.

Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33:1–22.

Fröhlich H. Including network knowledge into Cox regression models for biomarker signature discovery. Biometrical J. 2014;56:287–306.

Gong G. Some ideas on using the bootstrap in assessing model variability. In: Heiner KW, Sacher RS, Wilkinson JW, editors. Computer Science and Statistics: Proceedings of the 14th Symposium on the Interface. NewYork: Springer; 1982.

Good DM, Zürbig P, Argilés A, Bauer HW, Behrens G, Coon JJ, Dakna M, Decramer S, Delles C, Dominiczak AF, Ehrich JHH. Naturally occurring human urinary peptides for use in diagnosis of chronic kidney disease. Mol Cell Proteomic. 2010;9:2424–37.

Greenland S. Avoiding power loss associated with categorization and ordinal scores in dose-response and trend analysis. Epidemiology. 1995;6:450–4.

Groenwold RHH, Klungel OH, van der Graaf Y, Hoes AW, Moons KGM. Adjustment for continuous confounders: an example of how to prevent residual confounding. Can Med Assoc J. 2013;185:401–6.

Harrell FE. Regression modeling strategies. In: With applications to linear models, logistic and ordinal regression, and survival analysis. New York: Springer; 2001.

Harrell FE. Regression modeling strategies: with applications to linear models, logistic and ordinal regression, and survival analysis. 2nd ed. New York: Springer; 2015.

Harrell FE, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med. 1996;15:361–87.

Hastie T, Tibshirani R. Generalized additive models. New York: Chapman & Hall/CRC; 1990.

Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. 2nd ed. New York: Springer; 2009.

Hastie T, Tibshirani R, Wainwright M. Statistical learning with Sparsity: The lasso and generalizations. CRC Press LLC: Boca Raton. Monographs on statistics and applied probability; 2015.

Heinze G, Dunkler D. Five myths about variable selection. Transplant Int. 2017;30:6–10.

Heinze G, Wallisch C, Dunkler D. Variable selection – a review and recommendations for the practicing statistician. Biometrical J. 2018;60:431–49.

Hilsenbeck SG, Clark GM, Mcguire W. Why do so many prognostic factors fail to pan out? Breast Cancer Res Treat. 1992;22:197–206.

Hoerl AE, Kennard RW. Ridge Regression: biased estimation for nonorthogonal problems. Technometrics. 1970;12:55–67.

Hoeting JA, Madigan D, Raftery AE, Volinsky CT. Bayesian model averaging: a tutorial. Stat Sci. 1999;14:382–417.

Hofner B, Hothorn T, Kneib T, Schmid M. A framework for unbiased model selection based on boosting. J Comput Graphical Stat. 2011;20:956–71.

Hosmer D, Lemeshow S, May S. Applied survival analysis (2nd ed.). Wiley. Hoboken, NJ; 2008.

Hosmer D, Lemeshow S, Sturdivant RX. Applied logistic regression. 3rd ed. Hoboken: Wiley; 2013.

Huebner M, le Cessie S, Schmidt C, Vach W, On behalf of the Topic Group “Initial Data Analysis” of the STRATOS Initiative. A Contemporary Conceptual Framework for Initial Data Analysis. Observational Studies. 2018;4:171–92.

Janitza S, Binder H, Boulesteix AL. Pitfalls of hypothesis tests and model selection on boot- strap samples: causes and consequences in biometrical applications. Biometrical J. 2016;58:447–73.

Jenkner C, Lorenz E, Becher H, Sauerbrei W. Modeling continuous covariates with a ‘spike‘at zero: bivariate approaches. Biometrical J. 2016;58:783–96.

Lee PH. Is a cutoff of 10% appropriate for the change-in-estimate criterion of confounder identification? J Epidemiol. 2014;24:161–7.

Leeb H, Pötscher BM. Model selection and inference: facts and fiction. Econometric Theory. 2005;21:21–59.

Leffondre K, Abrahamowicz M, Siemiatycki J, Rachet B. Modeling smoking history: a comparison of different approaches. Am J Epidemiol. 2002;156:813–23.

Lin Y, Zhang HH. Component selection and smoothing in multivariate nonparametric American Journal of Epidemiology regression. Ann Stat. 2006;34:2272–97.

Lorenz E, Jenkner C, Sauerbrei W, Becher H. Modeling variables with a spike at zero. Examples and practical recommendations. Am J Epidemiol. 2017;185:1–39.

Maldonado G, Greenland S. Simulation of confounder-selection strategies. Am J Epidemiol. 1993;138:923–36.

Mallows CL. The zeroth problem. Am Stat. 1998;52:1–9.

Mantel N. Why stepdown procedures in variable selection? Technometrics. 1970;12:621–5.

Marcus R, Peritz E, Gabriel KR. On closed test procedures with special reference toordered analysis of variance. Biometrika. 1976;76:655–60.

Marra G, Wood SN. Practical variable selection for generalized additive models. Comput Stat Data Anal. 2011;55:2372–87.

Mayr A, Binder H, Gefeller O, Schmid M. The Evolution of boosting algorithms – from machine learning to statistical modelling. Methods Inf Med. 2014;53:419–27.

Meier L, van de Geer S, Bühlmann P. High-dimensional additive modeling. Ann Stat. 2009;37:3779–821.

Meinshausen N, Bühlmann P. Stability selection. J Stat Soc Series B Stat Methodol. 2010;72:417–73.

Miller A. Selection of subsets of regression variables. Journal of the Royal Statistical Society. Series A (General). 1984;147:389–425.

Miller R, Siegmund D. Maximally selected chi-square statistics. Biometrics. 1982;38:1011–6.

Moons KG, Altman KG, Reitsma JB, Ioannidis JP, Macaskill P, Steyerberg EW, Vickers AJ, Ransohoff DF, Collins GGS. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): Explanation and Elaboration. Ann Intern Med. 2015;162:W1–W73.

Morris T, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Stat Med. 2019;38:2074–102.

Nkuipou-Kenfack E, Zürbig P, Mischak H. The long path towards implementation of clinical proteomics: exemplified based on CKD273. Proteomics Clin Appl. 2017;11:5–6.

Perperoglou A, Sauerbrei W, Abrahamowicz M, Schmid M, on behalf of TG2 of the STRATOS initiative. A review of spline function procedures in R. BMC Med Res Methodol. 2019;19:46.

Picard RP, Cook RD. Cross-validation of regression models. J Am Stat Assoc. 1984;79:575–83.

Pullenayegum EM, Platt RW, Barwick M, Feldman BM, Offringa M, Thabane L. Knowledge translation in biostatistics: a survey of current practices, preferences, and barriers to the dissemination and uptake of new statistical methods. Stat Med. 2015;35:805–18.

Raftery AE. Bayesian model selection in social research. Sociol Methodol. 1995;25:111–63.

Ramaiola I, Padró T, Peña E, Juan-Babot O, Cubedo J, Martin-Yuste V, Sabate M, Badimon L. Changes in thrombus composition and profilin-1 release in acute myocardial infarction. Eur Heart J. 2015;36:965–75.

Ramsay JO. Monotone regression splines in action. Stat Sci. 1988;3:425–41.

Ravikumar P, Liu H, Lafferty J, Wasserman L. Spam. Sparse additive models. In Advances in Neural Information Processing Systems. Vol. 20 (eds J. Platt, D. Koller, Y. Singer S. Roweis). Cambridge, MIT Press; 2008.

Rosenberg PS, Katki H, Swanson CA, Brown LM, Wacholder S, Hoover RN. Quantifying epidemiologic risk factors using non-parametric regression: model selection remains the greatest challenge. Stat Med. 2003;22:3369–81.

Rospleszcz S, Janitza S, Boulesteix AL. Categorical variables with many categories are preferentially selected in bootstrap-based model selection procedures for multivariable regression models. Biometrical J. 2016;58:652–73.

Royston P, Altman DG. Regression using fractional polynomials of continuous covariates: parsimonious parametric modelling. Appl Stat. 1994;43:429–67.

Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med. 2006;25:127–41.

Royston P, Sauerbrei W. Multivariable modelling with cubic regression splines: a principled approach. Stata J. 2007;7:45–70.

Royston P, Sauerbrei W. Multivariable model-building. a pragmatic approach to regression analysis based on fractional polynomials for continuous variables. Wiley, Chichester; 2008.

Sauerbrei W. The use of resampling methods to simplify regression models in medical statistics. Appl Stat. 1999;48:313–29.

Sauerbrei W, Abrahamowicz M, Altman DG, le Cessie S, Carpenter J, on behalf of the STRATOS initiative. STRengthening Analytical Thinking for Observational Studies: The STRATOS initiative. Stat Med. 2014;33:5413–32.

Sauerbrei W, Buchholz A, Boulesteix AL, Binder H. On stability issues in deriving multivariable regression models. Biometrical J. 2015:57531–55.

Sauerbrei W, Meier-Hirmer C, Benner A, Royston P. Multivariable regression model building by using fractional polynomials: description of SAS, STATA and R programs. Comput Stat Data Anal. 2006;50:3464–85.

Sauerbrei W, Royston P. Building multivariable prognostic and diagnostic models: transformation of the predictors by using fractional polynomials. J Royal Stat Soc A. 1999;162:71–94.

Sauerbrei W, Royston P, Binder H. Selection of important variables and determination of functional form for continuous predictors in multivariable model building. Stat Med. 2007;26:5512–28.

Sauerbrei W, Schumacher M. A bootstrap resampling procedure for model building: application to the cox regression model. Stat Med. 1992;11:2093–109.

Schmid M, Hothorn T. Boosting additive models using componentwise P-splines. Comput Stat Data Anal. 2008;53:298–311.

Shaw PA, Deffner V, Dodd KW, Freedman LS, Keogh R, Kipnis V, Küchenhoff H, Tooze JA, on behalf of Measurement Error Working group (TG4) of the STRATOS initiative. Epidemiological analyses with error prone exposures: review of current practise and recommendations. Ann Epidemiol. 2018;28:82–828.

Shmueli G. To explain or to predict? Stat Sci. 2010;25:289–310.

Smith GCS, Seaman SR, Wood AM, Royston P, White IR. Correcting for Optimistic Prediction in Small Data Sets. Am J Epidemiol. 2014;180:318–24.

Steiner M, Kim Y. The Mechanics of omitted variable bias: bias amplification and cancellation of offsetting biases. J Causal Inference. 2016;4:20160009.

Sun GW, Shook TL, Kay GL. Inappropriate use of bivariable analysis to screen risk factors for use in multivariable analysis. J Clin Epidemiol. 1996;49:907–16.

Taylor J, Tibshirani RJ. Statistical learning and selective inference. Proc Natl Acad Sci USA. 2015;112:7629–34.

Teräsvirta T, Mellin I. Model selection criteria and model selection tests in regression models. Scand J Stat. 1986;13:159–71.

Tibshirani R. Regression shrinkage and selection via the Lasso. J Royal Stat Soc Series B Methodol. 1996;58:267–88.

Tibshirani R. Regression shrinkage and selection via the lasso: a retrospective. J Royal Stat Soc Series B. 2011;73:273–82.

Tibshirani R, Taylor J, Loftus J, Reid S. Selective inference: tools for selective inference. Proc Natl Acad Sci USA. 2017;112:7629–34.

Tutz G, Binder H. Generalized additive modelling with implicit variable selection by likelihood based boosting. Biometrics. 2016;62:961–71.

van Houwelingen HC. From model building to validation and back: a plea for robustness. Stat Med. 2014;33:5223–38.

van Houwelingen HC, Sauerbrei W. Cross-validation, shrinkage and variable selection in linear regression revisited. Open J Stat. 2013;3:79–102.

van Houwelingen JC, le Cessie S. Predictive value of statistical models. Stat Med. 1990;9:1303–25.

van Walraven C, Hart RG. Leave ‘em alone - why continuous variables should be analyzed as such. Neuroepidemiology. 2008;30:138–9.

Vandenbroucke JP, von Elm E, Altman DG, Gotzsche PC, Mulrow CD, Pocock SJ, Poole C, Schlesselman JJ, Egger M. Strengthening the Reporting of Observational Studies in Epidemiology (STROBE): Explanation and Elaboration. Epidemiology. 2007;18:805–35.

Vickers AJ, Lilja H. Cutpoints in clinical chemistry: time for fundamental reassessment. Clin Chem. 2009;55:15–7.

White H. Using least squares to approximate unknown regression functions. Int Econ Rev. 1980a;21:149–70.

White HA. Heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica. 1980b;48:817–38.

Wikimedia Foundation Inc; 2019. Statistical model. URL https://en.wikipedia.org/wiki/State_of_the_art. Accessed 1 July 2019.

Winter C, Kristiansen G, Kersting S, Roy J, Aust D, Knösel T, Rümmele P, Jahnke B, Hentrich V, Rückert F, Niedergethmann M, Weichert W, Bahra M, Schlitt HJ, Settmacher U, Friess H, Büchler M, Saeger H-D, Schroeder M, Pilarsky C, Grützmann R. Google goes cancer: improving outcome prediction for cancer patients by network-based ranking of marker genes. PLOS Comput Biol. 2012;8:e1002511.

Wood S. Thin plate regression splines. J Royal Stat Soc Series B. 2003;65:95–114.

Wood S. Generalized additive models. New York: Chapman & Hall/CRC; 2006.

Wood S. Generalized additive models: an introduction with R. Second Edition: CRC Press; 2017.

Zou H, Hastie T. Regularization and variable selection via the elastic net. J Royal Stat Soc Series B (Methodological). 2005;67:301–20.

Zou H. The adaptive LASSO and its oracle properties. J Am Stat Assoc. 2006;101:1418–29.