Classification based on multivariate mixed type longitudinal data with an application to the EU-SILC database
Tóm tắt
Although many present day studies gather data of a diverse nature (numeric quantities, binary indicators or ordered categories) on the same units repeatedly over time, there only exist limited number of approaches in the literature to analyse so-called mixed-type longitudinal data. We present a statistical model capable of joint modelling several mixed-type outcomes, which also accounts for possible dependencies among the investigated outcomes. A thresholding approach to link binary or ordinal variables to their latent numeric counterparts allows us to jointly model all, including latent, numeric outcomes using a multivariate version of the linear mixed-effects model. We avoid the independence assumption over outcomes by relaxing the variance matrix of random effects to a completely general positive definite matrix. Moreover, we follow model-based clustering methodology to create a mixture of such models to model heterogeneity in the temporal evolution of the considered outcomes. The estimation of such an hierarchical model is approached by Bayesian principles with the use of Markov chain Monte Carlo methods. After a successful simulation study with the aim to examine the ability to consistently estimate the true parameter values and thus discover the different patterns, the EU-SILC dataset consisting of Czech households that were followed for 4 years in a time span from 2005 to 2016 was analysed. The households were classified into groups with a similar evolution of several closely related indicators of monetary poverty based on estimated classification probabilities.
Tài liệu tham khảo
Aitkin M, Liu CC, Chadwick T (2009) Bayesian model comparison and model averaging for small-area estimation. Ann Appl Stat 3(1):199–221. https://doi.org/10.1214/08-AOAS205
Albert JH, Chib S (1993) Bayesian analysis of binary and polychotomous response data. J Am Stat Assoc 88(422):669–679. https://doi.org/10.2307/2290350
Banfield DJ, Raftery EA (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49(3):803–821
Brooks S, Gelman A, Jones G, Meng X (2011) Handbook for Markov chain Monte Carlo, 2nd edn. Taylor & Francis, Boca Raton
Bruckers L, Molenberghs G, Drinkenburg P, Geys H (2016) A clustering algorithm for multivariate longitudinal data. J Biopharm Stat 26(4):725–741
Celeux G, Martin O, Lavergne C (2005) Mixture of linear mixed models for clustering gene expression profiles from repeated microarray experiments. Stat Modell 5(3):243–267. https://doi.org/10.1191/1471082X05st096oa
De la Cruz-Mesía R, Quintana FA, Marshall G (2008) Model-based clustering for longitudinal data. Comput Stat Data Anal 52(3):1441–1457. https://doi.org/10.1016/j.csda.2007.04.005
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39(1):1–38
Fieuws S, Verbeke G (2004) Joint modelling of multivariate longitudinal profiles: pitfalls of the random-effects approach. Stat Med 23:3093–3104. https://doi.org/10.1002/sim.1885
Fieuws S, Verbeke G (2006) Pairwise fitting of mixed models for the joint modeling of multivariate longitudinal profiles. Biometrics 62(2):424–431
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631. https://doi.org/10.1198/016214502760047131
Frühwirth-Schnatter S (2006) Finite mixture and Markov switching models. Springer, Berlin
Frühwirth-Schnatter S (2011) Panel data analysis: a survey on model-based clustering of time series. Adv Data Anal Classif 5(4):251–280. https://doi.org/10.1007/s11634-011-0100-0
Frühwirth-Schnatter S, Malsiner-Walli G (2019) From here to infinity: sparse finite versus Dirichlet process mixtures in model-based clustering. Adv Data Anal Classif 13(1):33–64. https://doi.org/10.1007/s11634-018-0329-y
Frühwirth-Schnatter S, Pamminger C, Weber A, Winter-Ebmer R (2012) Labor market entry and earnings dynamics: Bayesian inference using mixtures-of-experts Markov chain clustering. J Appl Econom 27:1116–1137. https://doi.org/10.1002/jae.1249
Frühwirth-Schnatter S, Pittner S, Weber A, Winter-Ebmer R (2018) Analysing plant closure effects using time-varying mixture-of-experts Markov chain clustering. Ann Appl Stat 12:1796–1830. https://doi.org/10.1214/17-AOAS1132
Genz A (1992) Numerical computation of multivariate normal probabilities. J Comput Graph Stat 1(2):141–149
Genz A, Bretz F, Miwa T, Mi X, Leisch F, Scheipl F, Hothorn T (2019) mvtnorm: multivariate normal and t distributions. https://CRAN.R-project.org/package=mvtnorm, R package version 1.0-11
Grün B (2019) Model-based clustering. In: Frühwirth-Schnatter S, Celeux G, Robert CP (eds) Handbook of mixture analysis. CRC Press, Boca Raton, pp 157–192 (chap 8)
Grün B, Leisch F (2008) FlexMix version 2: finite mixtures with concomitant variables and varying and constant parameters. J Stat Softw 28(4):1–35
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, New York
James GM, Sugar CA (2003) Clustering for sparsely sampled functional data. J Am Stat Assoc 98(462):397–408. https://doi.org/10.1198/016214503000189
Komárek A, Komárková L (2013) Clustering for multivariate continuous and discrete longitudinal data. Ann Appl Stat 7(1):177–200. https://doi.org/10.1214/12-AOAS580
Komárek A, Komárková L (2014) Capabilities of R package mixAK for clustering based on multivariate continuous and discrete longitudinal data. J Stat Softw 59(12):1–38
Laird NM, Ware JH (1982) Random-effects models for longitudinal data. Biometrics 38(4):963–974
Liu X, Yang MCK (2009) Simultaneous curve registration and clustering for functional data. Comput Stat Data Anal 53(4):1361–1376. https://doi.org/10.1016/j.csda.2008.11.019
Ma P, Castillo-Davis CI, Zhong W, Liu JS (2006) A data-driven clustering method for time course gene expression data. Nucleic Acids Res 34(4):1261–1269. https://doi.org/10.1093/nar/gkl013
McNicholas PD, Murphy TB (2010) Model-based clustering of longitudinal data. Can J Stat 38(1):153–168. https://doi.org/10.1002/cjs.10047
Molenberghs G, Verbeke G (2005) Models for discrete longitudinal data. Springer, New York
Neal RM (2000) Markov chain sampling methods for Dirichlet process mixture models. J Comput Graph Stat 9(2):249–265
Proust-Lima C, Philipps V, Diakite A, Liquet B (2017) Estimation of extended mixed models using latent classes and latent processes: the R package lcmm. J Stat Softw 78(2):1–56
R Core Team (2021) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria
Raftery AE, Dean N (2006) Variable selection for model-based clustering. J Am Stat Assoc 101(473):168–178
Stephens M (2000) Dealing with label switching in mixture models. J R Stat Soc Ser B (Stat Methodol) 62(4):795–809
Tanner MA, Wong WH (1987) The calculation of posterior distributions by data augmentation. J Am Stat Assoc 82(398):528–550. https://doi.org/10.2307/2289457
Verbeke G, Lesaffre E (1996) A linear mixed-effects model with heterogeneity in the random-effects population. J Am Stat Assoc 91(433):217–221. https://doi.org/10.1080/01621459.1996.10476679
Villarroel L, Marshall G, Barón AE (2009) Cluster analysis using multivariate mixed effects models. Stat Med 28(20):2552–2565. https://doi.org/10.1002/sim.3632