Calibrated imputation for multivariate categorical data
Tóm tắt
Non-response is a major problem for anyone collecting and processing data. A commonly used technique to deal with missing data is imputation, where missing values are estimated and filled in into the dataset. Imputation can become challenging if the variable to be imputed has to comply with a known total. Even more challenging is the case where several variables in the same dataset need to be imputed and, in addition to known totals, logical restrictions between variables have to be satisfied. In our paper, we develop an approach for a broad class of imputation methods for multivariate categorical data such that previously published totals are preserved while logical restrictions on the data are satisfied. The developed approach can be used in combination with any imputation model that estimates imputation probabilities, i.e. the probability that imputation of a certain category for a variable in a certain unit leads to the correct value for this variable and unit.
Từ khóa
Tài liệu tham khảo
Chandru, V., Hooker, J.N.: Optimization Methods for Logical Inference. John Wiley & Sons, New York (1999)
Chauvet, G.: Méthodes de bootstrap en population finie. PhD Thesis, L’Université de Rennes (2007). https://pastel.archives-ouvertes.fr/tel-00267689/document. Accessed 16 Sep 2022
Chen, S.X.: Weighted polynomial models and weighted sampling schemes for finite population. Ann. Stat. 26, 1894–1915 (1998)
Chen, S.X.: General properties and estimation of conditional Bernoulli models. J. Multivar. Anal. 74, 69–87 (2000)
Chen, S.X., Liu, J.S.: Statistical applications of the Poisson-binomial and conditional Bernoulli distributions. Stat. Sinica 7, 875–892 (1997)
Chen, X.H., Dempster, A.P., Liu, J.S.: Weighted finite population sampling to maximize entropy. Biometrika 81, 457–469 (1994)
Cox, L.: A constructive procedure for unbiased controlled rounding. J. Am. Stat. Assoc. 82, 520–524 (1987)
Daalmans, J., Mass imputation for census estimation. Discussion paper 2017–04, Statistics Netherlands (2017). https://www.cbs.nl/en-gb/background/2017/11/mass-imputation-for-census-estimation. Accessed 16 Sep 2022
De Waal, T., Quere, R.: A fast and simple algorithm for automatic editing of mixed data. J. off. Stat. 19, 383–402 (2003)
De Waal, T., Pannekoek, J., Scholtus, S.: Handbook of Statistical Data Editing and Imputation. John Wiley & Sons, New York (2011)
De Waal, T., Coutinho, W., Shlomo, N.: Calibrated hot deck imputation for numerical data under edit restrictions. J. Surv. Stat. Methodol. 5, 372–397 (2017)
Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. Chapman & Hall/CRC, London (1993)
Favre, A.-C., Matei, A., Tillé, Y.: A variant of the Cox algorithm for the imputation of non-response of qualitative data. Comput. Stat. Data Anal. 45, 709–719 (2004)
Favre, A.-C., Matei, A., Tillé, Y.: Calibrated random imputation for qualitative data. J. Stat. Plan. Inference 128, 411–425 (2005)
Fellegi, I.P., Holt, D.: A systematic approach to automatic edit and imputation. J. Am. Stat. Assoc. 71, 17–35 (1976)
Hong, Y.: On computing the distribution function for the Poisson-binomial distribution. Comput. Stat. Data Anal. 59, 41–51 (2013)
Hooker, J.: Logic-based Methods for Optimization. John Wiley & Sons, New York (2000)
De Jonge, E., Van der Loo, M.: Error localization as a mixed-integer program in editrules. Discussion paper, Statistics Netherlands (2012), https://www.cbs.nl/nl-nl/achtergrond/2014/15/error-localization-as-a-mixed-integer-problem-with-the-editrules-package. Accessed 14 Oct 2022
Kuijvenhoven, L., Scholtus, S.: Bootstrapping combined estimators based on register and sample survey data. Discussion Paper, The Hague: Statistics Netherlands (2011). Available at: http://www.cbs.nl/nl-nl/achtergrond/2011/39/bootstrapping-combined-estimator-based-on-register-and-sample-survey-data. Accessed 16 Sep 2022
Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data, 2nd edn. New York, Wiley (2002)
Mashreghi, A., Haziza, D., Léger, C.: A survey of bootstrap methods in finite population sampling. Stat. Surv. 10, 1–52 (2016)
R Core Team: R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2020). https://www.R-project.org/. Accessed 16 Sep 2022
Raghunathan, T.E., Lepkowski, J.M., Van Hoewyk, J., Solenberger, P.: A multivariate technique for multiply imputing missing values using a sequence of regression models. Surv. Methodol. 27, 85–95 (2001)
Ripley, B.: Package nnet (2020). Available at https://cran.r-project.org/web/packages/nnet/nnet.pdf. Accessed 16 Sep 2022
Rubin, D.B.: Inference and missing data. Biometrika 63, 581–590 (1976)
Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, New York (1987)
Rubin, D.B.: Nested multiple imputation of NMES via partially incompatible MCMC. Stat. Neerl. 57, 3–18 (2003)
Schafer, J.L.: Analysis of Incomplete Multivariate Data. Chapman & Hall, London (1997)
Scholtus, S., Daalmans, J.: Variance estimation after mass imputation based on combined administrative and survey data. J. off. Stat. 37, 433–459 (2021)
Scholtus, S: Variances of census tables after mass imputation of educational attainment. Discussion Paper, The Hague: Statistics Netherland (2020). Available at: http://www.cbs.nl/en-gb/background/2018/49/variances-of-census-tables-after-mass-imputation. Accessed 16 Sep 2022
Siddique, J., Belin, T.: Multiple imputation using an iterative hot-deck with distance-based donor selection. Stat. Med. 27, 83–102 (2008)
Van Buuren, S.: Flexible Imputation of Missing Data. Chapman & Hall/CRC, Boca Raton, Florida (2012)
Van Buuren, S., Groothuis-Oudshoorn, K.: MICE: multivariate imputation by chained equations in R. J. Stat. Softw. 45, 1–67 (2011)
De Waal, T., Daalmans, J.: Multivariate mass imputation for the population census given known totals. Eurostat (2019). https://ec.europa.eu/eurostat/cros/system/files/admin_wp6_2018_nl.pdf. Accessed 16 Sep 2022