Pseudo-value regression trees

Alina Schenk1, Moritz Berger1, Matthias Schmid1
1Institute of Medical Biometry, Informatics and Epidemiology, Medical Faculty, University of Bonn, Bonn, Germany

Tóm tắt

This paper presents a semi-parametric modeling technique for estimating the survival function from a set of right-censored time-to-event data. Our method, named pseudo-value regression trees (PRT), is based on the pseudo-value regression framework, modeling individual-specific survival probabilities by computing pseudo-values and relating them to a set of covariates. The standard approach to pseudo-value regression is to fit a main-effects model using generalized estimating equations (GEE). PRT extend this approach by building a multivariate regression tree with pseudo-value outcome and by successively fitting a set of regularized additive models to the data in the nodes of the tree. Due to the combination of tree learning and additive modeling, PRT are able to perform variable selection and to identify relevant interactions between the covariates, thereby addressing several limitations of the standard GEE approach. In addition, PRT include time-dependent effects in the node-wise models. Interpretability of the PRT fits is ensured by controlling the tree depth. Based on the results of two simulation studies, we investigate the properties of the PRT method and compare it to several alternative modeling techniques. Furthermore, we illustrate PRT by analyzing survival in 3,652 patients enrolled for a randomized study on primary invasive breast cancer.

Từ khóa


Tài liệu tham khảo

Andersen PK, Pohar Perme M (2010) Pseudo-observations in survival analysis. Statist Methods Med Res 19:71–99 Andersen PK, Klein JP, Rosthøj S (2003) Generalised linear models for correlated pseudo-observations, with applications to multi-state models. Biometrika 90:15–27 Bacchetti P, Segal MR (1995) Survival trees with time-dependent covariates: Application to estimating changes in the incubation period of AIDS. Lifetime Data Anal 1:35–47 Binder N, Gerds TA, Andersen PK (2014) Pseudo-observations for competing risks with covariate dependent censoring. Lifetime Data Anal 20:303–315 Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. Taylor & Francis, New York Bühlmann P, Hothorn T (2007) Boosting algorithms: Regularization, prediction and model fitting. Statist Sci 22:477–505 Bühlmann P, Yu B (2003) Boosting with the L2 loss: regression and classification. J Am Statist Associat 98:324–339 Chen HL, Zhou MQ, Tian W, Meng KX, He HF (2016) Effect of age on breast cancer patient prognoses: a population-based study using the SEER 18 database. PLoS One 11(10):e0165409 Ciampi A, Negassa A, Lou Z (1995) Tree-structured prediction for censored survival data and the Cox model. J Clin Epidemiol 48:675–689 Coates AS, Winer EP, Goldhirsch A, Gelber RD, Gnant M, Piccart-Gebhart MJ, Thürlimann B, Senn H (2015) Tailoring therapies - improving the management of early breast cancer: St. Gallen international expert consensus on the primary therapy of early breast cancer 2015. Ann Oncol 26:1533–1546 Cox DR (1972) Regression models and life-tables. J Royal Statist Soc Ser B 34:187–220 de Gregorio A, Häberle L, Fasching PA, Müller V, Schrader I, Lorenz R, Forstbauer H, Friedl TWP, Bauer E, de Gregorio N, Deniz M, Fink V, Bekes I, Andergassen U, Schneeweiss A, Tesch H, Mahner S, Brucker SY, Blohmer JU, Fehm TN, Heinrich G, Lato K, Beckmann MW, Rack B, Janni W (2020) Gemcitabine as adjuvant chemotherapy in patients with high-risk early breast cancer - results from the randomized phase III SUCCESS-A trial. Breast Cancer Resh 22(1):111 Demirtas H (2004) Pseudo-random number generation in R for commonly used multivariate distributions. J Modern Appl Statist Methods 3:485–497 Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Statist 29:1189–1232 Friedman JH, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Statist 28:337–407 Garcia TP, Marder K, Wang Y (2019) Time-varying proportional odds model for mega-analysis of clustered event times. Biostatistics 20:129–146 Gerds TA, Kattan MW, Schumacher M, Yu C (2013) Estimating a time-dependent concordance index for survival prediction models with covariate dependent censoring. Statist Med 32:2173–2184 Goldhirsch A, Wood WC, Gelber RD, Coates AS, Thürlimann B, Senn HJ (2003) Meeting highlights: updated international expert consensus on the primary therapy of early breast cancer. J Clin Oncol 21:3357–3365 Grand MK, Putter H, Allignol A, Andersen PK (2019) A note on pseudo-observations and left-truncation. Biomet J 61:290–298 Graw F, Gerds TA, Schumacher M (2009) On pseudo-values for regression analysis in competing risks models. Lifetime Data Anal 15:241–255 Greenwell B (2022) Tree-based methods for statistical learning in R. Chapman & Hall/CRC, Boca Raton Grøn R, Gerds TA (2014) Binomial regression models. In: Klein JP, van Houwelingen HC, Ibrahim JG, Scheike TH (eds) Handbook of survival analysis. Chapman and Hall CRC, Boca Raton, pp 221–242 Hofner B, Müller J, Hothorn T (2011) Monotonicity-constrained species distribution models. Ecology 92:1895–1901 Hofner B, Mayr A, Robinzonov N, Schmid M (2014) Model-based boosting in R: a hands-on tutorial using the R package mboost. Computat Statist 29:3–35 Hothorn T (2019) Letter to the Editor response: Garcia et al. Biostatistics 20:546–548 Hothorn T, Hornik K, Zeileis A (2006) Unbiased recursive partitioning: a conditional inference framework. J Computat Graph Statist 15:651–674 Hothorn T, Kneib T, Bühlmann P (2014) Conditional transformation models. J Royal Statist Soc SerB 76:3–27 Hothorn T, Möst L, Bühlmann P (2018) Most likely transformations. Scandinav J Statist 45:110–134 Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS (2008) Random survival forests. Ann Appl Statist 2:841–860 Jia B, Zeng D, Liao JJZ, Liu GF, Tan X, Diao G, Ibrahim JG (2022) Mixture survival trees for cancer risk classification. Lifetime Data Anal 28:356–379 Kalbfleisch JD, Prentice RL (eds) (2002) The statistical analysis of failure time data, 2nd edn. Wiley, New York Klein JP, Andersen PK (2005) Regression modeling of competing risks data based on pseudovalues of the cumulative incidence function. Biometrics 61:223–229 Kvamme H, Borgan Ø (2023) The Brier score under administrative censoring: problems and a solution. J Mach Learn Res 24:2 Landwehr N, Hall MA, Frank E (2005) Logistic model trees. Mach Learn 59:161–205 LeBlanc M, Crowley J (1992) Relative risk trees for censored survival data. Biometrics 48:411–425 Lee C, Zame W, Yoon J, van der Schaar M (2018) DeepHit: A deep learning approach to survival analysis with competing risks. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, AAAI Press, Palo Alto, pp 2314–2321 Liang KY, Zeger SL (1986) Longitudinal data analysis using generalized linear models. Biometrika 73:13–22 Loh WY, Man M, Wang S (2019) Subgroups from regression trees with adjustment for prognostic effects and postselection inference. Statist Med 38:545–557 Mogensen UB, Gerds TA (2013) A random forest approach for competing risks based on pseudo-values. Statist Med 32:3102–3114 Molinaro AM, Dudoit S, van der Laan MJ (2004) Tree-based multivariate regression and density estimation with right-censored data. J Multivar Anal 90:154–177 Overgaard M, Parner ET, Pedersen J (2017) Asymptotic theory of generalized estimating equations based on jack-knife pseudo-observations. Ann Statist 45:1988–2015 Puth MT, Tutz G, Heim N, Münster E, Schmid M, Berger M (2020) Tree-based modeling of time-varying coefficients in discrete time-to-event models. Lifetime Data Anal 26:545–572 Quinlan JR (1992) Learning with continuous classes. In: proceedings of the 5th Australian joint conference on artificial intelligence, World Scientific, Singapore, pp 343–348 R Core Team (2022) R: a language and environment for statistical computing. R foundation for statistical computing, Vienna, Austria, https://www.R-project.org/ Rahman MM, Matsuo K, Matsuzaki S, Purushotham S (2021) DeepPseudo: Pseudo value based deep learning models for competing risk analysis. In: Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI Press, Palo Alto, pp 479–487 Scheike TH, Zhang MJ, Gerds TA (2008) Predicting cumulative incidence probability by direct binomial regression. Biometrika 95:205–220 Senkus E, Kyriakides S, Ohno S, Penault-Llorca F, Poortmans P, Rutgers E, Zackrisson S, Cardoso F, Guidelines Committee ESMO (2015) Primary breast cancer: ESMO clinical practice guidelines for diagnosis, treatment and follow-up. Ann Oncol 26(Suppl. 5):v8–v30 Stensrud MJ, Hernán MA (2020) Why test for proportional hazards? J Am Med Associat 323:1401–1402 Uno H, Cai T, Pencina MJ, D’Agostino RB, Wei LJ (2011) On the C-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Statist Med 30:1105–1117 van der Laan MJ, Robins JM (eds) (2003) Unified methods for censored longitudinal data and causality. Springer, New York van der Ploeg T, Datema F, de Jong RB, Steyerberg EW (2014) Prediction of survival with alternative modeling techniques using pseudo values. PLoS One 9(6):e100234 von Minckwitz G, Untch M, Blohmer JU, Costa SD, Eidtmann H, Fasching PA, Gerber B, Eiermann W, Hilfrich J, Huober J, Jackisch C, Kaufmann M, Konecny GE, Denkert C, Nekljudova V, Mehta K, Loibl S (2012) Definition and impact of pathologic complete response on prognosis after neoadjuvant chemotherapy in various intrinsic breast cancer subtypes. J Clin Oncol 30:1796–1804 Vatcheva KP, Lee ML, McCormick JB, Rahbar MH (2015) The effect of ignoring statistical interactions in regression analyses conducted in epidemiologic studies: an example with survival analysis using Cox proportional hazards regression model. Epidemiology (Sunnyvale, Calif) 6(1):216 Zeileis A, Hornik K (2007) Generalized M-fluctuation tests for parameter instability. Statist Neerland 61:488–508 Zeileis A, Hothorn T, Hornik K (2008) Model-based recursive partitioning. J Computat Graph Statist 17:492–514 Zhao L, Feng D (2020) Deep neural networks for survival analysis using pseudo values. IEEE J Biomed Health Inform 24:3308–3314 Zhao L, Murray S, Mariani LH, Ju W (2020) Incorporating longitudinal biomarkers for dynamic risk prediction in the era of big data: a pseudo-observation approach. Statist Med 39:3685–3699