Classification based on multivariate mixed type longitudinal data with an application to the EU-SILC database

Advances in Data Analysis and Classification - Tập 17 - Trang 369-406 - 2022
Jan Vávra1, Arnošt Komárek1
1Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic

Tóm tắt

Although many present day studies gather data of a diverse nature (numeric quantities, binary indicators or ordered categories) on the same units repeatedly over time, there only exist limited number of approaches in the literature to analyse so-called mixed-type longitudinal data. We present a statistical model capable of joint modelling several mixed-type outcomes, which also accounts for possible dependencies among the investigated outcomes. A thresholding approach to link binary or ordinal variables to their latent numeric counterparts allows us to jointly model all, including latent, numeric outcomes using a multivariate version of the linear mixed-effects model. We avoid the independence assumption over outcomes by relaxing the variance matrix of random effects to a completely general positive definite matrix. Moreover, we follow model-based clustering methodology to create a mixture of such models to model heterogeneity in the temporal evolution of the considered outcomes. The estimation of such an hierarchical model is approached by Bayesian principles with the use of Markov chain Monte Carlo methods. After a successful simulation study with the aim to examine the ability to consistently estimate the true parameter values and thus discover the different patterns, the EU-SILC dataset consisting of Czech households that were followed for 4 years in a time span from 2005 to 2016 was analysed. The households were classified into groups with a similar evolution of several closely related indicators of monetary poverty based on estimated classification probabilities.

Tài liệu tham khảo

Aitkin M, Liu CC, Chadwick T (2009) Bayesian model comparison and model averaging for small-area estimation. Ann Appl Stat 3(1):199–221. https://doi.org/10.1214/08-AOAS205 Albert JH, Chib S (1993) Bayesian analysis of binary and polychotomous response data. J Am Stat Assoc 88(422):669–679. https://doi.org/10.2307/2290350 Banfield DJ, Raftery EA (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49(3):803–821 Brooks S, Gelman A, Jones G, Meng X (2011) Handbook for Markov chain Monte Carlo, 2nd edn. Taylor & Francis, Boca Raton Bruckers L, Molenberghs G, Drinkenburg P, Geys H (2016) A clustering algorithm for multivariate longitudinal data. J Biopharm Stat 26(4):725–741 Celeux G, Martin O, Lavergne C (2005) Mixture of linear mixed models for clustering gene expression profiles from repeated microarray experiments. Stat Modell 5(3):243–267. https://doi.org/10.1191/1471082X05st096oa De la Cruz-Mesía R, Quintana FA, Marshall G (2008) Model-based clustering for longitudinal data. Comput Stat Data Anal 52(3):1441–1457. https://doi.org/10.1016/j.csda.2007.04.005 Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39(1):1–38 Fieuws S, Verbeke G (2004) Joint modelling of multivariate longitudinal profiles: pitfalls of the random-effects approach. Stat Med 23:3093–3104. https://doi.org/10.1002/sim.1885 Fieuws S, Verbeke G (2006) Pairwise fitting of mixed models for the joint modeling of multivariate longitudinal profiles. Biometrics 62(2):424–431 Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631. https://doi.org/10.1198/016214502760047131 Frühwirth-Schnatter S (2006) Finite mixture and Markov switching models. Springer, Berlin Frühwirth-Schnatter S (2011) Panel data analysis: a survey on model-based clustering of time series. Adv Data Anal Classif 5(4):251–280. https://doi.org/10.1007/s11634-011-0100-0 Frühwirth-Schnatter S, Malsiner-Walli G (2019) From here to infinity: sparse finite versus Dirichlet process mixtures in model-based clustering. Adv Data Anal Classif 13(1):33–64. https://doi.org/10.1007/s11634-018-0329-y Frühwirth-Schnatter S, Pamminger C, Weber A, Winter-Ebmer R (2012) Labor market entry and earnings dynamics: Bayesian inference using mixtures-of-experts Markov chain clustering. J Appl Econom 27:1116–1137. https://doi.org/10.1002/jae.1249 Frühwirth-Schnatter S, Pittner S, Weber A, Winter-Ebmer R (2018) Analysing plant closure effects using time-varying mixture-of-experts Markov chain clustering. Ann Appl Stat 12:1796–1830. https://doi.org/10.1214/17-AOAS1132 Genz A (1992) Numerical computation of multivariate normal probabilities. J Comput Graph Stat 1(2):141–149 Genz A, Bretz F, Miwa T, Mi X, Leisch F, Scheipl F, Hothorn T (2019) mvtnorm: multivariate normal and t distributions. https://CRAN.R-project.org/package=mvtnorm, R package version 1.0-11 Grün B (2019) Model-based clustering. In: Frühwirth-Schnatter S, Celeux G, Robert CP (eds) Handbook of mixture analysis. CRC Press, Boca Raton, pp 157–192 (chap 8) Grün B, Leisch F (2008) FlexMix version 2: finite mixtures with concomitant variables and varying and constant parameters. J Stat Softw 28(4):1–35 Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, New York James GM, Sugar CA (2003) Clustering for sparsely sampled functional data. J Am Stat Assoc 98(462):397–408. https://doi.org/10.1198/016214503000189 Komárek A, Komárková L (2013) Clustering for multivariate continuous and discrete longitudinal data. Ann Appl Stat 7(1):177–200. https://doi.org/10.1214/12-AOAS580 Komárek A, Komárková L (2014) Capabilities of R package mixAK for clustering based on multivariate continuous and discrete longitudinal data. J Stat Softw 59(12):1–38 Laird NM, Ware JH (1982) Random-effects models for longitudinal data. Biometrics 38(4):963–974 Liu X, Yang MCK (2009) Simultaneous curve registration and clustering for functional data. Comput Stat Data Anal 53(4):1361–1376. https://doi.org/10.1016/j.csda.2008.11.019 Ma P, Castillo-Davis CI, Zhong W, Liu JS (2006) A data-driven clustering method for time course gene expression data. Nucleic Acids Res 34(4):1261–1269. https://doi.org/10.1093/nar/gkl013 McNicholas PD, Murphy TB (2010) Model-based clustering of longitudinal data. Can J Stat 38(1):153–168. https://doi.org/10.1002/cjs.10047 Molenberghs G, Verbeke G (2005) Models for discrete longitudinal data. Springer, New York Neal RM (2000) Markov chain sampling methods for Dirichlet process mixture models. J Comput Graph Stat 9(2):249–265 Proust-Lima C, Philipps V, Diakite A, Liquet B (2017) Estimation of extended mixed models using latent classes and latent processes: the R package lcmm. J Stat Softw 78(2):1–56 R Core Team (2021) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria Raftery AE, Dean N (2006) Variable selection for model-based clustering. J Am Stat Assoc 101(473):168–178 Stephens M (2000) Dealing with label switching in mixture models. J R Stat Soc Ser B (Stat Methodol) 62(4):795–809 Tanner MA, Wong WH (1987) The calculation of posterior distributions by data augmentation. J Am Stat Assoc 82(398):528–550. https://doi.org/10.2307/2289457 Verbeke G, Lesaffre E (1996) A linear mixed-effects model with heterogeneity in the random-effects population. J Am Stat Assoc 91(433):217–221. https://doi.org/10.1080/01621459.1996.10476679 Villarroel L, Marshall G, Barón AE (2009) Cluster analysis using multivariate mixed effects models. Stat Med 28(20):2552–2565. https://doi.org/10.1002/sim.3632