Calibrated imputation for multivariate categorical data

Ton de Waal1,2, Jacco Daalmans1
1Statistics Netherlands, The Hague, The Netherlands
2Tilburg University, Tilburg, The Netherlands

Tóm tắt

Non-response is a major problem for anyone collecting and processing data. A commonly used technique to deal with missing data is imputation, where missing values are estimated and filled in into the dataset. Imputation can become challenging if the variable to be imputed has to comply with a known total. Even more challenging is the case where several variables in the same dataset need to be imputed and, in addition to known totals, logical restrictions between variables have to be satisfied. In our paper, we develop an approach for a broad class of imputation methods for multivariate categorical data such that previously published totals are preserved while logical restrictions on the data are satisfied. The developed approach can be used in combination with any imputation model that estimates imputation probabilities, i.e. the probability that imputation of a certain category for a variable in a certain unit leads to the correct value for this variable and unit.

Từ khóa


Tài liệu tham khảo

Chandru, V., Hooker, J.N.: Optimization Methods for Logical Inference. John Wiley & Sons, New York (1999) Chauvet, G.: Méthodes de bootstrap en population finie. PhD Thesis, L’Université de Rennes (2007). https://pastel.archives-ouvertes.fr/tel-00267689/document. Accessed 16 Sep 2022 Chen, S.X.: Weighted polynomial models and weighted sampling schemes for finite population. Ann. Stat. 26, 1894–1915 (1998) Chen, S.X.: General properties and estimation of conditional Bernoulli models. J. Multivar. Anal. 74, 69–87 (2000) Chen, S.X., Liu, J.S.: Statistical applications of the Poisson-binomial and conditional Bernoulli distributions. Stat. Sinica 7, 875–892 (1997) Chen, X.H., Dempster, A.P., Liu, J.S.: Weighted finite population sampling to maximize entropy. Biometrika 81, 457–469 (1994) Cox, L.: A constructive procedure for unbiased controlled rounding. J. Am. Stat. Assoc. 82, 520–524 (1987) Daalmans, J., Mass imputation for census estimation. Discussion paper 2017–04, Statistics Netherlands (2017). https://www.cbs.nl/en-gb/background/2017/11/mass-imputation-for-census-estimation. Accessed 16 Sep 2022 De Waal, T., Quere, R.: A fast and simple algorithm for automatic editing of mixed data. J. off. Stat. 19, 383–402 (2003) De Waal, T., Pannekoek, J., Scholtus, S.: Handbook of Statistical Data Editing and Imputation. John Wiley & Sons, New York (2011) De Waal, T., Coutinho, W., Shlomo, N.: Calibrated hot deck imputation for numerical data under edit restrictions. J. Surv. Stat. Methodol. 5, 372–397 (2017) Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. Chapman & Hall/CRC, London (1993) Favre, A.-C., Matei, A., Tillé, Y.: A variant of the Cox algorithm for the imputation of non-response of qualitative data. Comput. Stat. Data Anal. 45, 709–719 (2004) Favre, A.-C., Matei, A., Tillé, Y.: Calibrated random imputation for qualitative data. J. Stat. Plan. Inference 128, 411–425 (2005) Fellegi, I.P., Holt, D.: A systematic approach to automatic edit and imputation. J. Am. Stat. Assoc. 71, 17–35 (1976) Hong, Y.: On computing the distribution function for the Poisson-binomial distribution. Comput. Stat. Data Anal. 59, 41–51 (2013) Hooker, J.: Logic-based Methods for Optimization. John Wiley & Sons, New York (2000) De Jonge, E., Van der Loo, M.: Error localization as a mixed-integer program in editrules. Discussion paper, Statistics Netherlands (2012), https://www.cbs.nl/nl-nl/achtergrond/2014/15/error-localization-as-a-mixed-integer-problem-with-the-editrules-package. Accessed 14 Oct 2022 Kuijvenhoven, L., Scholtus, S.: Bootstrapping combined estimators based on register and sample survey data. Discussion Paper, The Hague: Statistics Netherlands (2011). Available at: http://www.cbs.nl/nl-nl/achtergrond/2011/39/bootstrapping-combined-estimator-based-on-register-and-sample-survey-data. Accessed 16 Sep 2022 Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data, 2nd edn. New York, Wiley (2002) Mashreghi, A., Haziza, D., Léger, C.: A survey of bootstrap methods in finite population sampling. Stat. Surv. 10, 1–52 (2016) R Core Team: R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2020). https://www.R-project.org/. Accessed 16 Sep 2022 Raghunathan, T.E., Lepkowski, J.M., Van Hoewyk, J., Solenberger, P.: A multivariate technique for multiply imputing missing values using a sequence of regression models. Surv. Methodol. 27, 85–95 (2001) Ripley, B.: Package nnet (2020). Available at https://cran.r-project.org/web/packages/nnet/nnet.pdf. Accessed 16 Sep 2022 Rubin, D.B.: Inference and missing data. Biometrika 63, 581–590 (1976) Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, New York (1987) Rubin, D.B.: Nested multiple imputation of NMES via partially incompatible MCMC. Stat. Neerl. 57, 3–18 (2003) Schafer, J.L.: Analysis of Incomplete Multivariate Data. Chapman & Hall, London (1997) Scholtus, S., Daalmans, J.: Variance estimation after mass imputation based on combined administrative and survey data. J. off. Stat. 37, 433–459 (2021) Scholtus, S: Variances of census tables after mass imputation of educational attainment. Discussion Paper, The Hague: Statistics Netherland (2020). Available at: http://www.cbs.nl/en-gb/background/2018/49/variances-of-census-tables-after-mass-imputation. Accessed 16 Sep 2022 Siddique, J., Belin, T.: Multiple imputation using an iterative hot-deck with distance-based donor selection. Stat. Med. 27, 83–102 (2008) Van Buuren, S.: Flexible Imputation of Missing Data. Chapman & Hall/CRC, Boca Raton, Florida (2012) Van Buuren, S., Groothuis-Oudshoorn, K.: MICE: multivariate imputation by chained equations in R. J. Stat. Softw. 45, 1–67 (2011) De Waal, T., Daalmans, J.: Multivariate mass imputation for the population census given known totals. Eurostat (2019). https://ec.europa.eu/eurostat/cros/system/files/admin_wp6_2018_nl.pdf. Accessed 16 Sep 2022