Empirical Bayes Estimation and Prediction Using Summary-Level Information From External Big Data Sources Adjusting for Violations of Transportability
Tóm tắt
Large external data sources may be available to augment studies that collect data to address a specific research objective. In this article we consider the problem of building regression models for prediction based on individual-level data from an “internal” study while incorporating summary information from an “external” big data source. We extend the work of Chatterjee et al. (J Am Stat Assoc 111(513):107–117, 2006) by introducing an adaptive empirical Bayes shrinkage estimator that uses the external summary-level information and the internal data to trade bias with variance for protection against departures in the conditional probability distribution of the outcome given a set of covariates between the two populations. We use simulation studies and a real data application using external summary information from the Prostate Cancer Prevention Trial to assess the performance of the proposed methods in contrast to maximum likelihood estimation and the constrained maximum likelihood (CML) method developed by Chatterjee et al. (J Am Stat Assoc 111(513):107–117, 2006). Our simulation studies show that the CML method can be biased and inefficient when the assumption of a transportable covariate distribution between the external and internal populations is violated, and our empirical Bayes estimator provides protection against bias and loss of efficiency.
Tài liệu tham khảo
Breslow NE, Holubkov R (1997) Maximum likelihood estimation of logistic regression parameters under two- phase, outcome-dependent sampling. J R Stat Soc 59(2):447–461. https://doi.org/10.1111/1467-9868.00078
Chatterjee N, Chen YH, Maas P, Carroll RJ (2016a) Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources. J Am Stat Assoc 111(513):107–117. https://doi.org/10.1080/01621459.2015.1123157
Chatterjee N, Chen YH, Maas P, Carroll RJ (2016b) Rejoinder. J Am Stat Assoc 111(513):130–131. https://doi.org/10.1080/01621459.2016.1149407
Chen YH, Chen H (2000) A unified approach to regression analysis under double-sampling designs. J R Stat Soc 62(3):449–460. https://doi.org/10.1111/1467-9868.00243
Deville JC, Sarndal CE (1992) Calibration estimators in survey sampling. J Am Stat Assoc 87(418):376–382. https://doi.org/10.1080/01621459.1992.10475217
Grill S, Ankerst DP, Gail MH, Chatterjee N, Pfeiffer RM (2017) Comparison of approaches for incorporating new information into existing risk prediction models. Stat Med 36(7):1134–1156
Han P, Lawless JF (2016) Comment. J Am Stat Assoc 111(513):118–121. https://doi.org/10.1080/01621459.2016.1149399
Haneuse S, Rivera C (2016) Comment. J Am Stat Assoc 111(513):121–122. https://doi.org/10.1080/01621459.2016.1149401
Lawless JF, Kalbfleisch JD, Wild CJ (1999) Semiparametric methods for response-selective and missing data problems in regression. J R Stat Soc 61(2):413–438
Louis TA, Keiding N (2016) Comment. J Am Stat Assoc 111(513):123–124. https://doi.org/10.1080/01621459.2016.1149403
Lumley T, Shaw PA, Dai JY (2011) Connections between survey calibration estimators and semiparametric models for incomplete data. Int Stat Rev 79(2):200–220. https://doi.org/10.1111/j.1751-5823.2011.00138.x
Mefford JA, Zaitlen NA, Witte JS (2016) Comment: a human genetics perspective. J Am Stat Assoc 111(513):124–127. https://doi.org/10.1080/01621459.2016.1149404
Mukherjee B, Chatterjee N (2008) Exploiting gene-environment independence for analysis of case-control studies: an empirical Bayes-type shrinkage estimator to trade-off between bias and efficiency. Biometrics 64(3):685–694. https://doi.org/10.1111/j.1541-0420.2007.00953.x
Patel CJ, Dominici F (2016) Comment: addressing the need for portability in big data model building and calibration. J Am Stat Assoc 111(513):127–129. https://doi.org/10.1080/01621459.2016.1149406
Robins JM, Rotnitzky A, Zhao LP (1994) Estimation of regression coefficients when some regressors are not always observed. J Am Stat Assoc 89(427):846–866. https://doi.org/10.1080/01621459.1994.10476818
Scott AJ, Wild CJ (1997) Fitting regression models to case-control data by maximum likelihood. Biometrika 84(1):57–71
Thompson IM, Ankerst DP, Chi C, Goodman PJ, Tangen CM, Lucia MS, Feng Z, Parnes HL, Coltman CA Jr (2006) Assessing prostate cancer risk: results from the prostate cancer prevention trial. J Natl Cancer Inst 98(8):529. https://doi.org/10.1093/jnci/djj131
Tomlins SA, Day JR, Lonigro RJ, Hovelson DH, Siddiqui J, Kunju LP, Dunn RL, Meyer S, Hodge P, Groskopf J et al (2016) Urine tmprss2: Erg plus pca3 for individualized prostate cancer risk assessment. Eur Urol 70(1):45–53. https://doi.org/10.1016/j.eururo.2015.04.039
Wu C (2003) Optimal calibration estimators in survey sampling. Biometrika 90(4):937. https://doi.org/10.1093/biomet/90.4.937
Wu C, Sitter RR (2001) A model-calibration approach to using complete auxiliary information from survey data. J Am Stat Assoc 96(453):185–193. https://doi.org/10.1198/016214501750333054