State of the art in selection of variables and functional forms in multivariable analysis—outstanding issues
Tóm tắt
Từ khóa
Tài liệu tham khảo
Abrahamowicz M, du Berger R, Grover SA. Flexible modelling of the effects of serum cholesterol on coronary heart disease mortality. Am J Epidemiol. 1997;145:714–29.
Altman DG, Andersen PK. Bootstrap investigation of the stability of a Cox regression model. Stat Med. 1989;8:771–83.
Altman DG, Lausen B, Sauerbrei W, Schumacher M. The dangers of using ‘optimal’cutpoints in the evaluation of prognostic factors. J Nat Cancer Inst. 1994;86:829–35.
Altman DG, McShane LM, Sauerbrei W, Taube SE. Reporting recommendations for tumor marker prognostic studies (REMARK): explanation and elaboration. PLoS Med. 2012;9:e1001216.
Antoniadis A, Gijbels I, Verhasselt A. Variable selection in additive models using P-splines. Technometrics. 2012;54:425–38.
Arem H, Moore SC, Patel A, Hartge P, Berrington DE, Gonzalez A, Visvanathan K, Campbell PT, Freedman M, Weiderpass E, Adami HO, Linet MS, Lee IM, Matthews CE. Leisure Time physical activity and mortality. A detailed pooled analysis of the dose-response relationship. JAMA Intern Med. 2015;175:959–67.
Augustin N, Sauerbrei W, Schumacher M. The practical utility of incorporating model selection uncertainty into prognostic models for survival data. Stat Model. 2015;5:95–118.
Becher H. Analysis of continuous covariates and dose-effect analysis. In: Ahrens W, Pigeot I (Eds) Handbook of epidemiology. 2nd edition. Heidelberg: Springer Verlag; 2014.
Becher H, Lorenz E, Royston P, Sauerbrei W. Analysing covariates with spike at zero. a modified FP procedure and conceptual issues. Biometrical J. 2012;54:686–700.
Benedetti A, Abrahamowicz M. Using generalized additive models to reduce residual confounding. Stat Med. 2004;23:3781–801.
Binder H, Schumacher M. Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models. BMC Bioinform. 2008;9:14.
Binder H, Sauerbrei W, Royston P. Comparison between splines and fractional polynomials for multivariable model building with continuous covariates: a simulation study with continuous response. Stat Med. 2013;32:2262–77.
Boulesteix AL, Binder H, Abrahamowicz M, Sauerbrei W. On the necessity and design of studies comparing statistical methods. Biometrical J. 2018;60:216–8.
Breiman L. The little bootstrap and other methods for dimensionality selection in regression: X-fixed prediction error. J Am Stat Assoc. 1992;87:738–54.
Buckland ST, Burnham KP, Augustin NH. Model selection: an integral part of inference. Biometrics. 1997;53:603–18.
Bühlmann P. Hothorn. Boosting algorithms: regularization, prediction and model fitting. Stat Sci. 2007;22:477–505.
Burnham KP, Anderson DR. Model selection and multimodel inference: a practical information- theoretic approach. New York: Springer; 2002.
Bursac Z, Gauss CH, Williams DK, Hosmer DW. Purposeful selection of variables in logistic regression. Source Code Biol Med. 2008;3:17.
Chatfield C. Model uncertainty, data mining and statistical inference (with discussion). J Royal Stat Soc Series B. 1995;158:419–66.
Chen C, George SL. The bootstrap and identification of prognostic factors via Cox’s proportional hazards regression model. Stat Med. 1985;4:39–46.
Chouldechova A, Hastie T. Generalized additive model selection. arXiv preprint 2015;arXiv:1506.03850.
Copas JB, Long T. Estimating the residual variance in orthogonal regression with variable selection. Journal of the Royal Statistical Society. Series D (The Statistician). 1991;40:51-59.
Cox DR. Comment on Breiman, L. (2001). Statistical modeling: the two cultures. Stat Sci. 2001;16:216–8.
Dakna M, Harris K, Kalousi A, Carpentier S, Kolch W, Schanstra JP, Haubitz M, Vlahou A, Mischak H, Girolami M. Addressing the challenge of defining valid proteomic biomarkers and classifiers. BMC Bioinform. 2010;11:594.
de Bin R, Sauerbrei W. Handling co-dependence issues in resampling-based variable selection procedures: a simulation study. J Stat Comput Simul. 2018:8828–55.
de Bin R, Janitza S, Sauerbrei W, Boulesteix AL. Subsampling versus bootstrapping in resampling-based model selection for multivariable regression. Biometrics. 2016;72:272–80.
de Boor C. A practical guide to splines revised. Revised Edition. New York: Springer; 2001.
Dorie V, Hill J, Shalit U, Scott M, Cervone D. Automated versus do-it-yourself methods for causal inference: lessons learned from a data analysis competition. Stat Sci. 2019;34:43–68.
Draper D. Assessment and propagation of model selection uncertainty (with) discussion. J Royal Stat Soc Series B. 1995;57:45–97.
Dunkler D, Plischke M, Leffondré K, Heinze G. Augmented backward elimination: a pragmatic and purposeful way to develop statistical models. PLoS ONE. 2014;9:e113677.
Dunkler D, Sauerbrei W, Heinze G. Global, parameterwise and joint shrinkage factor estimation. J Stat Softw. 2016;69:1–19.
Efron B. Comment on Breiman, L. (2001). Statistical modeling: the two cultures. Stat Sci. 2001;16:218–9.
Efroymson MA. Multiple regression analysis. in: Ralston A and Wilf HS(ed.). Mathematical methods for digital computers. John Wiley. New York; 1960.
Eilers PHC, Marx BD. Flexible smoothing with B-splines and penalties (with comments and rejoinder). Stat Sci. 1996;11:89–121.
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc. 2001;96:1348–60.
Freund Y, Schapire R. Experiments with a new boosting algorithm. In: Proceedings of the Thirteenth International Conference on Machine Learning Theory. San Francisco, CA: Morgan Kaufmann Publishers Inc; 1996.
Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001;29:1189–232.
Friedman JH, Hastie T, Tibshirani R. Additive logistic regression: a statistical view of boosting (with discussion). Ann Stat. 2000;28:337–407.
Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33:1–22.
Fröhlich H. Including network knowledge into Cox regression models for biomarker signature discovery. Biometrical J. 2014;56:287–306.
Gong G. Some ideas on using the bootstrap in assessing model variability. In: Heiner KW, Sacher RS, Wilkinson JW, editors. Computer Science and Statistics: Proceedings of the 14th Symposium on the Interface. NewYork: Springer; 1982.
Good DM, Zürbig P, Argilés A, Bauer HW, Behrens G, Coon JJ, Dakna M, Decramer S, Delles C, Dominiczak AF, Ehrich JHH. Naturally occurring human urinary peptides for use in diagnosis of chronic kidney disease. Mol Cell Proteomic. 2010;9:2424–37.
Greenland S. Avoiding power loss associated with categorization and ordinal scores in dose-response and trend analysis. Epidemiology. 1995;6:450–4.
Groenwold RHH, Klungel OH, van der Graaf Y, Hoes AW, Moons KGM. Adjustment for continuous confounders: an example of how to prevent residual confounding. Can Med Assoc J. 2013;185:401–6.
Harrell FE. Regression modeling strategies. In: With applications to linear models, logistic and ordinal regression, and survival analysis. New York: Springer; 2001.
Harrell FE. Regression modeling strategies: with applications to linear models, logistic and ordinal regression, and survival analysis. 2nd ed. New York: Springer; 2015.
Harrell FE, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med. 1996;15:361–87.
Hastie T, Tibshirani R. Generalized additive models. New York: Chapman & Hall/CRC; 1990.
Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. 2nd ed. New York: Springer; 2009.
Hastie T, Tibshirani R, Wainwright M. Statistical learning with Sparsity: The lasso and generalizations. CRC Press LLC: Boca Raton. Monographs on statistics and applied probability; 2015.
Heinze G, Wallisch C, Dunkler D. Variable selection – a review and recommendations for the practicing statistician. Biometrical J. 2018;60:431–49.
Hilsenbeck SG, Clark GM, Mcguire W. Why do so many prognostic factors fail to pan out? Breast Cancer Res Treat. 1992;22:197–206.
Hoerl AE, Kennard RW. Ridge Regression: biased estimation for nonorthogonal problems. Technometrics. 1970;12:55–67.
Hoeting JA, Madigan D, Raftery AE, Volinsky CT. Bayesian model averaging: a tutorial. Stat Sci. 1999;14:382–417.
Hofner B, Hothorn T, Kneib T, Schmid M. A framework for unbiased model selection based on boosting. J Comput Graphical Stat. 2011;20:956–71.
Huebner M, le Cessie S, Schmidt C, Vach W, On behalf of the Topic Group “Initial Data Analysis” of the STRATOS Initiative. A Contemporary Conceptual Framework for Initial Data Analysis. Observational Studies. 2018;4:171–92.
Janitza S, Binder H, Boulesteix AL. Pitfalls of hypothesis tests and model selection on boot- strap samples: causes and consequences in biometrical applications. Biometrical J. 2016;58:447–73.
Jenkner C, Lorenz E, Becher H, Sauerbrei W. Modeling continuous covariates with a ‘spike‘at zero: bivariate approaches. Biometrical J. 2016;58:783–96.
Lee PH. Is a cutoff of 10% appropriate for the change-in-estimate criterion of confounder identification? J Epidemiol. 2014;24:161–7.
Leeb H, Pötscher BM. Model selection and inference: facts and fiction. Econometric Theory. 2005;21:21–59.
Leffondre K, Abrahamowicz M, Siemiatycki J, Rachet B. Modeling smoking history: a comparison of different approaches. Am J Epidemiol. 2002;156:813–23.
Lin Y, Zhang HH. Component selection and smoothing in multivariate nonparametric American Journal of Epidemiology regression. Ann Stat. 2006;34:2272–97.
Lorenz E, Jenkner C, Sauerbrei W, Becher H. Modeling variables with a spike at zero. Examples and practical recommendations. Am J Epidemiol. 2017;185:1–39.
Maldonado G, Greenland S. Simulation of confounder-selection strategies. Am J Epidemiol. 1993;138:923–36.
Mallows CL. The zeroth problem. Am Stat. 1998;52:1–9.
Marcus R, Peritz E, Gabriel KR. On closed test procedures with special reference toordered analysis of variance. Biometrika. 1976;76:655–60.
Marra G, Wood SN. Practical variable selection for generalized additive models. Comput Stat Data Anal. 2011;55:2372–87.
Mayr A, Binder H, Gefeller O, Schmid M. The Evolution of boosting algorithms – from machine learning to statistical modelling. Methods Inf Med. 2014;53:419–27.
Miller A. Selection of subsets of regression variables. Journal of the Royal Statistical Society. Series A (General). 1984;147:389–425.
Moons KG, Altman KG, Reitsma JB, Ioannidis JP, Macaskill P, Steyerberg EW, Vickers AJ, Ransohoff DF, Collins GGS. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): Explanation and Elaboration. Ann Intern Med. 2015;162:W1–W73.
Morris T, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Stat Med. 2019;38:2074–102.
Nkuipou-Kenfack E, Zürbig P, Mischak H. The long path towards implementation of clinical proteomics: exemplified based on CKD273. Proteomics Clin Appl. 2017;11:5–6.
Perperoglou A, Sauerbrei W, Abrahamowicz M, Schmid M, on behalf of TG2 of the STRATOS initiative. A review of spline function procedures in R. BMC Med Res Methodol. 2019;19:46.
Pullenayegum EM, Platt RW, Barwick M, Feldman BM, Offringa M, Thabane L. Knowledge translation in biostatistics: a survey of current practices, preferences, and barriers to the dissemination and uptake of new statistical methods. Stat Med. 2015;35:805–18.
Ramaiola I, Padró T, Peña E, Juan-Babot O, Cubedo J, Martin-Yuste V, Sabate M, Badimon L. Changes in thrombus composition and profilin-1 release in acute myocardial infarction. Eur Heart J. 2015;36:965–75.
Ravikumar P, Liu H, Lafferty J, Wasserman L. Spam. Sparse additive models. In Advances in Neural Information Processing Systems. Vol. 20 (eds J. Platt, D. Koller, Y. Singer S. Roweis). Cambridge, MIT Press; 2008.
Rosenberg PS, Katki H, Swanson CA, Brown LM, Wacholder S, Hoover RN. Quantifying epidemiologic risk factors using non-parametric regression: model selection remains the greatest challenge. Stat Med. 2003;22:3369–81.
Rospleszcz S, Janitza S, Boulesteix AL. Categorical variables with many categories are preferentially selected in bootstrap-based model selection procedures for multivariable regression models. Biometrical J. 2016;58:652–73.
Royston P, Altman DG. Regression using fractional polynomials of continuous covariates: parsimonious parametric modelling. Appl Stat. 1994;43:429–67.
Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med. 2006;25:127–41.
Royston P, Sauerbrei W. Multivariable modelling with cubic regression splines: a principled approach. Stata J. 2007;7:45–70.
Royston P, Sauerbrei W. Multivariable model-building. a pragmatic approach to regression analysis based on fractional polynomials for continuous variables. Wiley, Chichester; 2008.
Sauerbrei W. The use of resampling methods to simplify regression models in medical statistics. Appl Stat. 1999;48:313–29.
Sauerbrei W, Abrahamowicz M, Altman DG, le Cessie S, Carpenter J, on behalf of the STRATOS initiative. STRengthening Analytical Thinking for Observational Studies: The STRATOS initiative. Stat Med. 2014;33:5413–32.
Sauerbrei W, Buchholz A, Boulesteix AL, Binder H. On stability issues in deriving multivariable regression models. Biometrical J. 2015:57531–55.
Sauerbrei W, Meier-Hirmer C, Benner A, Royston P. Multivariable regression model building by using fractional polynomials: description of SAS, STATA and R programs. Comput Stat Data Anal. 2006;50:3464–85.
Sauerbrei W, Royston P. Building multivariable prognostic and diagnostic models: transformation of the predictors by using fractional polynomials. J Royal Stat Soc A. 1999;162:71–94.
Sauerbrei W, Royston P, Binder H. Selection of important variables and determination of functional form for continuous predictors in multivariable model building. Stat Med. 2007;26:5512–28.
Sauerbrei W, Schumacher M. A bootstrap resampling procedure for model building: application to the cox regression model. Stat Med. 1992;11:2093–109.
Schmid M, Hothorn T. Boosting additive models using componentwise P-splines. Comput Stat Data Anal. 2008;53:298–311.
Shaw PA, Deffner V, Dodd KW, Freedman LS, Keogh R, Kipnis V, Küchenhoff H, Tooze JA, on behalf of Measurement Error Working group (TG4) of the STRATOS initiative. Epidemiological analyses with error prone exposures: review of current practise and recommendations. Ann Epidemiol. 2018;28:82–828.
Smith GCS, Seaman SR, Wood AM, Royston P, White IR. Correcting for Optimistic Prediction in Small Data Sets. Am J Epidemiol. 2014;180:318–24.
Steiner M, Kim Y. The Mechanics of omitted variable bias: bias amplification and cancellation of offsetting biases. J Causal Inference. 2016;4:20160009.
Sun GW, Shook TL, Kay GL. Inappropriate use of bivariable analysis to screen risk factors for use in multivariable analysis. J Clin Epidemiol. 1996;49:907–16.
Taylor J, Tibshirani RJ. Statistical learning and selective inference. Proc Natl Acad Sci USA. 2015;112:7629–34.
Teräsvirta T, Mellin I. Model selection criteria and model selection tests in regression models. Scand J Stat. 1986;13:159–71.
Tibshirani R. Regression shrinkage and selection via the Lasso. J Royal Stat Soc Series B Methodol. 1996;58:267–88.
Tibshirani R. Regression shrinkage and selection via the lasso: a retrospective. J Royal Stat Soc Series B. 2011;73:273–82.
Tibshirani R, Taylor J, Loftus J, Reid S. Selective inference: tools for selective inference. Proc Natl Acad Sci USA. 2017;112:7629–34.
Tutz G, Binder H. Generalized additive modelling with implicit variable selection by likelihood based boosting. Biometrics. 2016;62:961–71.
van Houwelingen HC. From model building to validation and back: a plea for robustness. Stat Med. 2014;33:5223–38.
van Houwelingen HC, Sauerbrei W. Cross-validation, shrinkage and variable selection in linear regression revisited. Open J Stat. 2013;3:79–102.
van Walraven C, Hart RG. Leave ‘em alone - why continuous variables should be analyzed as such. Neuroepidemiology. 2008;30:138–9.
Vandenbroucke JP, von Elm E, Altman DG, Gotzsche PC, Mulrow CD, Pocock SJ, Poole C, Schlesselman JJ, Egger M. Strengthening the Reporting of Observational Studies in Epidemiology (STROBE): Explanation and Elaboration. Epidemiology. 2007;18:805–35.
Vickers AJ, Lilja H. Cutpoints in clinical chemistry: time for fundamental reassessment. Clin Chem. 2009;55:15–7.
White H. Using least squares to approximate unknown regression functions. Int Econ Rev. 1980a;21:149–70.
White HA. Heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica. 1980b;48:817–38.
Wikimedia Foundation Inc; 2019. Statistical model. URL https://en.wikipedia.org/wiki/State_of_the_art. Accessed 1 July 2019.
Winter C, Kristiansen G, Kersting S, Roy J, Aust D, Knösel T, Rümmele P, Jahnke B, Hentrich V, Rückert F, Niedergethmann M, Weichert W, Bahra M, Schlitt HJ, Settmacher U, Friess H, Büchler M, Saeger H-D, Schroeder M, Pilarsky C, Grützmann R. Google goes cancer: improving outcome prediction for cancer patients by network-based ranking of marker genes. PLOS Comput Biol. 2012;8:e1002511.
Zou H, Hastie T. Regularization and variable selection via the elastic net. J Royal Stat Soc Series B (Methodological). 2005;67:301–20.