Multiple imputation of discrete and continuous data by fully conditional specification

Statistical Methods in Medical Research - Tập 16 Số 3 - Trang 219-242 - 2007
Stef van Buuren1
1TNO Quality of Life, Leiden, The Netherlands and University of Utrecht, The Netherlands,

Tóm tắt

The goal of multiple imputation is to provide valid inferences for statistical estimates from incomplete data. To achieve that goal, imputed values should preserve the structure in the data, as well as the uncertainty about this structure, and include any knowledge about the process that generated the missing data. Two approaches for imputing multivariate data exist: joint modeling (JM) and fully conditional specification (FCS). JM is based on parametric statistical theory, and leads to imputation procedures whose statistical properties are known. JM is theoretically sound, but the joint model may lack flexibility needed to represent typical data features, potentially leading to bias. FCS is a semi-parametric and flexible alternative that specifies the multivariate model by a series of conditional models, one for each incomplete variable. FCS provides tremendous flexibility and is easy to apply, but its statistical properties are difficult to establish. Simulation work shows that FCS behaves very well in the cases studied. The present paper reviews and compares the approaches. JM and FCS were applied to pubertal development data of 3801 Dutch girls that had missing data on menarche (two categories), breast development (five categories) and pubic hair development (six stages). Imputations for these data were created under two models: a multivariate normal model with rounding and a conditionally specified discrete model. The JM approach introduced biases in the reference curves, whereas FCS did not. The paper concludes that FCS is a useful and easily applied flexible alternative to JM when no convenient and realistic joint distribution can be specified.

Từ khóa


Tài liệu tham khảo

10.1002/9780470316696

10.1080/01621459.1996.10476908

10.1037/1082-989X.6.4.330

10.1198/000313005X74016

Dempster AP, 1977, Statistical Methodology, 39, 1

10.1002/9781119013563

10.1201/9781439821862

10.1177/096228029900800102

Stern HS, 2001, Psychological Methods, 6, 317

10.4135/9781412985079

10.1037/1082-989X.7.2.147

10.1002/sim.4780100410

10.1177/096228029900800103

10.1093/oxfordjournals.aje.a117592

10.1097/00001648-200207000-00012

Abraham WT, 2004, Psychiatry, 17, 315

10.1097/01.chi.0000181044.06337.6f

10.1097/00006199-200111000-00010

10.1002/nur.10015

10.1097/00006199-200209000-00012

10.1002/nur.20100

Molenberghs G., 1999, Revue d'Epidemiologie et de Sante Publique, 47, 499

10.1002/sim.689

10.1146/annurev.publhealth.25.102802.124410

10.1016/0895-4356(94)00124-9

10.1016/S0895-4356(01)00433-4

10.1002/hec.966

10.1249/01.mss.0000185651.59486.4e

10.1093/ije/dyh297

10.1111/j.1745-3984.2002.tb01173.x

10.3102/00346543074004525

Walczak B., 2001, Systems, 58, 29

10.1177/096228020101000605

10.1016/S0377-2217(02)00578-7

10.1037/0021-843X.112.4.545

10.1016/S0895-4356(01)00476-0

10.22237/jmasm/1099267500

10.1177/070674370204700111

10.1007/s10654-005-7919-7

10.2307/2532847

10.1081/BIP-120015744

10.1016/j.psychres.2004.08.001

10.1002/sim.2099

10.1191/1740774505cn119oa

10.1002/sim.2231

10.1177/01632780122034920

10.1198/016214504000001844

10.1111/1467-9574.00218

10.1111/1467-9574.00219

Meng XL, 1995, Statistical Science, 10, 538

10.1002/(SICI)1097-0258(19990330)18:6<681::AID-SIM71>3.0.CO;2-R

Abayomi K., 2005, Diagnostics for multivariate imputations

10.1016/0167-9473(95)00057-7

Rubin DB, 1986, Journal of Business Economics and Statistics, 4, 87, 10.1080/07350015.1986.10509497

Little Rja., 1988, Journal of Business Economics and Statistics, 6, 287, 10.1080/07350015.1988.10509663

10.1007/978-1-4757-3462-1

10.1080/01621459.1993.10476321

10.1198/016214505000000754

Brand Jpl., 1999, Development, implementation and evaluation of multiple imputation strategies for the statistical analysis of incomplete data sets

Raghunathan TE, 2001, Survey Methodology, 27, 85

10.1093/biomet/92.4.971

10.1002/(SICI)1097-0258(19970115)16:1<5::AID-SIM469>3.0.CO;2-8

10.1016/j.atmosenv.2004.02.026

10.1093/biomet/89.3.529

Heckman JJ, 1976, Annals of Economic and Social Measurement, 5, 475

10.1080/01621459.1982.10477793

10.2307/2532387

Pan W., 2001, Analysis, 7, 111

10.1111/j.0006-341X.2000.01139.x

10.1111/j.0006-341X.2000.00199.x

Bechger TM, 2002, Genetics, 32, 145

10.1111/j.0006-341X.2001.00022.x

10.1289/ehp.7199

10.1186/1471-2156-4-S1-S42

Heeringa SG, 2002, Multivariate imputation of coarsened survey data on household wealth

Rubin DB, 1990 Proceedings of the Statistical Computing Section

10.1111/1467-9574.00217

10.1080/10629360600810434

10.1198/000313001317098266

Kennickell AB, 1991, Proceedings of the Section on Survey Research Methods, 1

Heckerman D., 2001, Journal of Machine Learning Research, 1, 49

10.1198/016214504000000458

Van Buuren S, 2000, Life

Arnold BC, 1999, Conditional specification of statistical models

10.1080/01621459.1970.10481076

Besag J., 1974, Statistical Methodology, 36, 192

10.1080/01621459.1989.10478750

Gelman A., 1993, Statistical Methodology, 55, 185

10.1214/ss/1177011136

10.1177/1536867X0400400301

10.1177/1536867X0500500404

10.1203/00006450-200003000-00006

10.1136/adc.44.235.291

10.1203/00006450-200110000-00010

Little Rja., 1992, Journal of the American Statistical Association, 87, 1227

10.1007/978-1-4899-3242-6

10.1007/978-0-387-21706-2

Hastie TJ, 1990, Generalized additive models

10.1198/0003130032314

Ake CF, 2005, Proceedings, 112

Allison PD, 2005, SUGI 30 Proceedings, 113

10.1002/(SICI)1097-0258(19991130)18:22<3123::AID-SIM277>3.0.CO;2-2

Gelman A., 2001, Statistical Science, 16, 249

10.1002/hec.766

10.22237/jmasm/1114907160