Pseudo-value regression trees
Tóm tắt
This paper presents a semi-parametric modeling technique for estimating the survival function from a set of right-censored time-to-event data. Our method, named pseudo-value regression trees (PRT), is based on the pseudo-value regression framework, modeling individual-specific survival probabilities by computing pseudo-values and relating them to a set of covariates. The standard approach to pseudo-value regression is to fit a main-effects model using generalized estimating equations (GEE). PRT extend this approach by building a multivariate regression tree with pseudo-value outcome and by successively fitting a set of regularized additive models to the data in the nodes of the tree. Due to the combination of tree learning and additive modeling, PRT are able to perform variable selection and to identify relevant interactions between the covariates, thereby addressing several limitations of the standard GEE approach. In addition, PRT include time-dependent effects in the node-wise models. Interpretability of the PRT fits is ensured by controlling the tree depth. Based on the results of two simulation studies, we investigate the properties of the PRT method and compare it to several alternative modeling techniques. Furthermore, we illustrate PRT by analyzing survival in 3,652 patients enrolled for a randomized study on primary invasive breast cancer.
Từ khóa
Tài liệu tham khảo
Andersen PK, Pohar Perme M (2010) Pseudo-observations in survival analysis. Statist Methods Med Res 19:71–99
Andersen PK, Klein JP, Rosthøj S (2003) Generalised linear models for correlated pseudo-observations, with applications to multi-state models. Biometrika 90:15–27
Bacchetti P, Segal MR (1995) Survival trees with time-dependent covariates: Application to estimating changes in the incubation period of AIDS. Lifetime Data Anal 1:35–47
Binder N, Gerds TA, Andersen PK (2014) Pseudo-observations for competing risks with covariate dependent censoring. Lifetime Data Anal 20:303–315
Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. Taylor & Francis, New York
Bühlmann P, Hothorn T (2007) Boosting algorithms: Regularization, prediction and model fitting. Statist Sci 22:477–505
Bühlmann P, Yu B (2003) Boosting with the L2 loss: regression and classification. J Am Statist Associat 98:324–339
Chen HL, Zhou MQ, Tian W, Meng KX, He HF (2016) Effect of age on breast cancer patient prognoses: a population-based study using the SEER 18 database. PLoS One 11(10):e0165409
Ciampi A, Negassa A, Lou Z (1995) Tree-structured prediction for censored survival data and the Cox model. J Clin Epidemiol 48:675–689
Coates AS, Winer EP, Goldhirsch A, Gelber RD, Gnant M, Piccart-Gebhart MJ, Thürlimann B, Senn H (2015) Tailoring therapies - improving the management of early breast cancer: St. Gallen international expert consensus on the primary therapy of early breast cancer 2015. Ann Oncol 26:1533–1546
Cox DR (1972) Regression models and life-tables. J Royal Statist Soc Ser B 34:187–220
de Gregorio A, Häberle L, Fasching PA, Müller V, Schrader I, Lorenz R, Forstbauer H, Friedl TWP, Bauer E, de Gregorio N, Deniz M, Fink V, Bekes I, Andergassen U, Schneeweiss A, Tesch H, Mahner S, Brucker SY, Blohmer JU, Fehm TN, Heinrich G, Lato K, Beckmann MW, Rack B, Janni W (2020) Gemcitabine as adjuvant chemotherapy in patients with high-risk early breast cancer - results from the randomized phase III SUCCESS-A trial. Breast Cancer Resh 22(1):111
Demirtas H (2004) Pseudo-random number generation in R for commonly used multivariate distributions. J Modern Appl Statist Methods 3:485–497
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Statist 29:1189–1232
Friedman JH, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Statist 28:337–407
Garcia TP, Marder K, Wang Y (2019) Time-varying proportional odds model for mega-analysis of clustered event times. Biostatistics 20:129–146
Gerds TA, Kattan MW, Schumacher M, Yu C (2013) Estimating a time-dependent concordance index for survival prediction models with covariate dependent censoring. Statist Med 32:2173–2184
Goldhirsch A, Wood WC, Gelber RD, Coates AS, Thürlimann B, Senn HJ (2003) Meeting highlights: updated international expert consensus on the primary therapy of early breast cancer. J Clin Oncol 21:3357–3365
Grand MK, Putter H, Allignol A, Andersen PK (2019) A note on pseudo-observations and left-truncation. Biomet J 61:290–298
Graw F, Gerds TA, Schumacher M (2009) On pseudo-values for regression analysis in competing risks models. Lifetime Data Anal 15:241–255
Greenwell B (2022) Tree-based methods for statistical learning in R. Chapman & Hall/CRC, Boca Raton
Grøn R, Gerds TA (2014) Binomial regression models. In: Klein JP, van Houwelingen HC, Ibrahim JG, Scheike TH (eds) Handbook of survival analysis. Chapman and Hall CRC, Boca Raton, pp 221–242
Hofner B, Müller J, Hothorn T (2011) Monotonicity-constrained species distribution models. Ecology 92:1895–1901
Hofner B, Mayr A, Robinzonov N, Schmid M (2014) Model-based boosting in R: a hands-on tutorial using the R package mboost. Computat Statist 29:3–35
Hothorn T (2019) Letter to the Editor response: Garcia et al. Biostatistics 20:546–548
Hothorn T, Hornik K, Zeileis A (2006) Unbiased recursive partitioning: a conditional inference framework. J Computat Graph Statist 15:651–674
Hothorn T, Kneib T, Bühlmann P (2014) Conditional transformation models. J Royal Statist Soc SerB 76:3–27
Hothorn T, Möst L, Bühlmann P (2018) Most likely transformations. Scandinav J Statist 45:110–134
Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS (2008) Random survival forests. Ann Appl Statist 2:841–860
Jia B, Zeng D, Liao JJZ, Liu GF, Tan X, Diao G, Ibrahim JG (2022) Mixture survival trees for cancer risk classification. Lifetime Data Anal 28:356–379
Kalbfleisch JD, Prentice RL (eds) (2002) The statistical analysis of failure time data, 2nd edn. Wiley, New York
Klein JP, Andersen PK (2005) Regression modeling of competing risks data based on pseudovalues of the cumulative incidence function. Biometrics 61:223–229
Kvamme H, Borgan Ø (2023) The Brier score under administrative censoring: problems and a solution. J Mach Learn Res 24:2
Landwehr N, Hall MA, Frank E (2005) Logistic model trees. Mach Learn 59:161–205
LeBlanc M, Crowley J (1992) Relative risk trees for censored survival data. Biometrics 48:411–425
Lee C, Zame W, Yoon J, van der Schaar M (2018) DeepHit: A deep learning approach to survival analysis with competing risks. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, AAAI Press, Palo Alto, pp 2314–2321
Liang KY, Zeger SL (1986) Longitudinal data analysis using generalized linear models. Biometrika 73:13–22
Loh WY, Man M, Wang S (2019) Subgroups from regression trees with adjustment for prognostic effects and postselection inference. Statist Med 38:545–557
Mogensen UB, Gerds TA (2013) A random forest approach for competing risks based on pseudo-values. Statist Med 32:3102–3114
Molinaro AM, Dudoit S, van der Laan MJ (2004) Tree-based multivariate regression and density estimation with right-censored data. J Multivar Anal 90:154–177
Overgaard M, Parner ET, Pedersen J (2017) Asymptotic theory of generalized estimating equations based on jack-knife pseudo-observations. Ann Statist 45:1988–2015
Puth MT, Tutz G, Heim N, Münster E, Schmid M, Berger M (2020) Tree-based modeling of time-varying coefficients in discrete time-to-event models. Lifetime Data Anal 26:545–572
Quinlan JR (1992) Learning with continuous classes. In: proceedings of the 5th Australian joint conference on artificial intelligence, World Scientific, Singapore, pp 343–348
R Core Team (2022) R: a language and environment for statistical computing. R foundation for statistical computing, Vienna, Austria, https://www.R-project.org/
Rahman MM, Matsuo K, Matsuzaki S, Purushotham S (2021) DeepPseudo: Pseudo value based deep learning models for competing risk analysis. In: Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI Press, Palo Alto, pp 479–487
Scheike TH, Zhang MJ, Gerds TA (2008) Predicting cumulative incidence probability by direct binomial regression. Biometrika 95:205–220
Senkus E, Kyriakides S, Ohno S, Penault-Llorca F, Poortmans P, Rutgers E, Zackrisson S, Cardoso F, Guidelines Committee ESMO (2015) Primary breast cancer: ESMO clinical practice guidelines for diagnosis, treatment and follow-up. Ann Oncol 26(Suppl. 5):v8–v30
Stensrud MJ, Hernán MA (2020) Why test for proportional hazards? J Am Med Associat 323:1401–1402
Uno H, Cai T, Pencina MJ, D’Agostino RB, Wei LJ (2011) On the C-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Statist Med 30:1105–1117
van der Laan MJ, Robins JM (eds) (2003) Unified methods for censored longitudinal data and causality. Springer, New York
van der Ploeg T, Datema F, de Jong RB, Steyerberg EW (2014) Prediction of survival with alternative modeling techniques using pseudo values. PLoS One 9(6):e100234
von Minckwitz G, Untch M, Blohmer JU, Costa SD, Eidtmann H, Fasching PA, Gerber B, Eiermann W, Hilfrich J, Huober J, Jackisch C, Kaufmann M, Konecny GE, Denkert C, Nekljudova V, Mehta K, Loibl S (2012) Definition and impact of pathologic complete response on prognosis after neoadjuvant chemotherapy in various intrinsic breast cancer subtypes. J Clin Oncol 30:1796–1804
Vatcheva KP, Lee ML, McCormick JB, Rahbar MH (2015) The effect of ignoring statistical interactions in regression analyses conducted in epidemiologic studies: an example with survival analysis using Cox proportional hazards regression model. Epidemiology (Sunnyvale, Calif) 6(1):216
Zeileis A, Hornik K (2007) Generalized M-fluctuation tests for parameter instability. Statist Neerland 61:488–508
Zeileis A, Hothorn T, Hornik K (2008) Model-based recursive partitioning. J Computat Graph Statist 17:492–514
Zhao L, Feng D (2020) Deep neural networks for survival analysis using pseudo values. IEEE J Biomed Health Inform 24:3308–3314
Zhao L, Murray S, Mariani LH, Ju W (2020) Incorporating longitudinal biomarkers for dynamic risk prediction in the era of big data: a pseudo-observation approach. Statist Med 39:3685–3699