Variational Bayes estimation of hierarchical Dirichlet-multinomial mixtures for text clustering

Computational Statistics - Tập 38 - Trang 2015-2051 - 2023
Massimo Bilancia1, Michele Di Nanni2, Fabio Manca3, Gianvito Pio4
1Department of Precision and Regenerative Medicine and Ionian Area (DiMePRe-J), University of Bari Aldo Moro, Policlinic University Hospital, Bari, Italy
2EY Business and Technology Solution, Bari, Italy
3Department of Education, Psychology, Communication (ForPsiCom), University of Bari Aldo Moro, Palazzo Chiaia - Napolitano, Bari, Italy
4Department of Computer Science, University of Bari "Aldo Moro", Bari, Italy

Tóm tắt

In this paper, we formulate a hierarchical Bayesian version of the Mixture of Unigrams model for text clustering and approach its posterior inference through variational inference. We compute the explicit expression of the variational objective function for our hierarchical model under a mean-field approximation. We then derive the update equations of a suitable algorithm based on coordinate ascent to find local maxima of the variational target, and estimate the model parameters through the optimized variational hyperparameters. The advantages of variational algorithms over traditional Markov Chain Monte Carlo methods based on iterative posterior sampling are also discussed in detail.

Tài liệu tham khảo

Aggarwal CC, Zhai C (2012) Mining text data. Springer, New York. https://doi.org/10.1007/978-1-4614-3223-4 Airoldi EM, Blei D, Erosheva EA et al (2014) Handbook of mixed membership models and their applications. Chapman and Hall, Boca Raton. https://doi.org/10.1201/b17520 Anastasiu DC, Tagarelli A, Karypis G (2014) Document clustering: the next frontier. In: Aggarwal CC, Reddy CK (eds) Data clustering: algorithms and applications. Chapman & Hall, Boca Raton, pp 305–338 Anderlucci L, Viroli C (2020) Mixtures of Dirichlet-multinomial distributions for supervised and unsupervised classification of short text data. Adv Data Anal Classif 14:759–770. https://doi.org/10.1007/s11634-020-00399-3 Andrews N, Fox E (2007) Recent developments in document clustering. http://hdl.handle.net/10919/19473, Virginia Tech computer science technical report, TR-07-35 Apté C, Damerau F, Weiss SM (1994) Automated learning of decision rules for text categorization. ACM Trans Inf Syst 12:233–251. https://doi.org/10.1145/183422.183423 Awasthi P, Risteski A (2015) On some provably correct cases of variational inference for topic models. In: Cortes C, Lawrence N, Lee D et al (eds) Advances in neural information processing systems, vol 28. Curran Associates, Inc., New York Baudry JP, Celeux G (2015) EM for mixtures. Inizialiation requires special care. Stat Comput 25:713–726. https://doi.org/10.1007/s11222-015-9561-x Baudry JP, Maugis C, Michel B (2012) Slope heuristics: overview and implementation. Stat Comput 22:455–470. https://doi.org/10.1007/s11222-011-9236-1 Blanchard P, Higham DJ, Higham NJ (2021) Accurately computing the log-sum-exp and softmax functions. IMA J Numer Anal 41:2311–2330. https://doi.org/10.1093/imanum/draa038 Blei DM (2012) Probabilistic topic models. Commun ACM 55:77–84. https://doi.org/10.1145/2133806.2133826 Blei DM, Lafferty JD (2007) A correlated topic model of science. Ann Appl Stat. https://doi.org/10.1214/07-AOAS114 Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022 Blei DM, Kucukelbir A, McAuliffe JD (2017) Variational inference: a review for statisticians. J Am Stat Assoc 112:859–877. https://doi.org/10.1080/01621459.2017.1285773 Celeux G, Hurn M, Robert CP (2000) Computational and inferential difficulties with mixture posterior distributions. J Am Stat Assoc 95:957–970. https://doi.org/10.1080/01621459.2000.10474285 Celeux G, Früwirth-Schnatter S, Robert CP (2018a) Model selection for mixture models—perspectives and strategies. In: Frühwirth-Schnatter S, Celeux G, Robert CP (eds) Handbook of mixture analysis. Chapmann & Hall, New York, pp 118–154. https://doi.org/10.1201/9780429055911 Celeux G, Kamary K, Malsiner-Walli G et al (2018b) Computational solutions for Bayesian inference in mixture models. In: Frühwirth-Schnatter S, Celeux G, Robert CP (eds) Handbook of mixture analysis. Chapmann & Hall, New York, pp 73–115. https://doi.org/10.1201/9780429055911 Chandra NK, Canale A, Dunson DB (2020) Escaping the curse of dimensionality in Bayesian model based clustering. arxiv:2006.02700 Dayton CM, Macready GB (1988) Concomitant-variable latent-class models. J Am Stat Assoc 83:173. https://doi.org/10.2307/2288938 Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42:143–175. https://doi.org/10.1023/A:1007612920971 Diebolt J, Robert CP (1994) Estimation of finite mixture distributions through Bayesian sampling. J R Stat Soc Ser B (Methodol) 56:363–375. https://doi.org/10.1111/j.2517-6161.1994.tb01985.x Feinerer I, Hornik K, Meyer D (2008) Text mining infrastructure in R. J Stati Softw. https://doi.org/10.18637/jss.v025.i05 Feinerer I, Hornik K (2020) tm: text mining package. https://CRAN.R-project.org/package=tm, R package version 0.7-8 Frühwirth-Schnatter S (2004) Estimating marginal likelihoods for mixture and Markov switching models using bridge sampling techniques. Econom J 7:143–167. https://doi.org/10.1111/j.1368-423X.2004.00125.x Frühwirth-Schnatter S (2006) Finite mixture and Markov switching models. Springer, New York. https://doi.org/10.1007/978-0-387-35768-3 Gelman A, Carlin J, Stern H et al (2013) Bayesian data analysis, 3rd edn. Chapman and Hall, Boca Raton Ghahramani Z (2015) Probabilistic machine learning and artificial intelligence. Nature 521:452–459. https://doi.org/10.1038/nature14541 Greene D, Cunningham P (2006) Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of the 23rd international conference on machine learning (ICML’06). ACM Press, pp 377–384 Harris ZS (1954) Distributional structure. WORD 10:146–162. https://doi.org/10.1080/00437956.1954.11659520 Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning, 2nd edn. Springer, New York. https://doi.org/10.1007/978-0-387-84858-7 Hornik K, Feinerer I, Kober M et al (2012) Spherical \(k\)-means clustering. J Stat Softw. https://doi.org/10.18637/jss.v050.i10 Jordan MI, Ghahramani Z, Jaakkola TS et al (1999) An introduction to variational methods for graphical models. Mach Learn 37:183–233. https://doi.org/10.1023/A:1007665907178 Kaggle (2022) Sports dataset(bbc). https://www.kaggle.com/datasets/maneesh99/sports-datasetbbc. Accessed 04 Nov 2022 Keribin C (2000) Consistent estimation of the order of mixture models. Sankhyā Indian J Stat Ser A (1961–2002) 62:49–66 Kunkel D, Peruggia M (2020) Anchored Bayesian Gaussian mixture models. Electron J Stat. https://doi.org/10.1214/20-EJS1756 Lee SY (2021) Gibbs sampler and coordinate ascent variational inference: a set-theoretical review. Commun Stat Theory Methods. https://doi.org/10.1080/03610926.2021.1921214 Li H, Fan X (2016) A pivotal allocation-based algorithm for solving the label-switching problem in Bayesian mixture models. J Comput Graph Stat 25:266–283. https://doi.org/10.1080/10618600.2014.983643 Maechler M (2022) Rmpfr: R mpfr—multiple precision floating-point reliable. https://cran.r-project.org/package=Rmpfr, R package version 0.8-9 Malsiner-Walli G, Frühwirth-Schnatter S, Grün B (2016) Model-based clustering based on sparse finite Gaussian mixtures. Stat Comput 26:303–324. https://doi.org/10.1007/s11222-014-9500-2 Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge Marin JM, Robert C (2008) Approximating the marginal likelihood in mixture models. Indian Bayesian Soc Newslett 5:2–7 Mosimann JE (1962) On the compound multinomial distribution, the multivariate Beta-distribution, and correlations among proportions. Biometrika 49:65–82. https://doi.org/10.1093/biomet/49.1-2.65 Murphy KP (2012) Machine learning: a probabilistic perspective. The MIT Press, Cambridge Nielsen F, Garcia V (2009) Statistical exponential families: a digest with flash cards. arXiv:0911.4863 Nigam K, Mccallum AK, Thrun S et al (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39:103–134. https://doi.org/10.1023/A:1007692713085 Nikita M (2020) ldatuning: tuning of the latent Dirichlet allocation models parameters. https://CRAN.R-project.org/package=ldatuning, R package version 1.0.2 Plummer S, Pati D, Bhattacharya A (2020) Dynamics of coordinate ascent variational inference: a case study in 2D Ising models. Entropy 22:1263. https://doi.org/10.3390/e22111263 Pollice A, Bilancia M (2000) A hierarchical finite mixture model for Bayesian classification in the presence of auxiliary information. Metron Int J Stat LVIII:109–131 R Core Team (2022) R: a language and environment for statistical computing. https://www.R-project.org/ Rakib MRH, Zeh N, Jankowska M et al (2020) Enhancement of short text clustering by iterative classification. In: Métais E, Meziane F, Horacek H et al (eds) Natural language processing and information systems. Springer, Berlin, pp 105–117. https://doi.org/10.1007/978-3-030-51310-8_10 Robert CP (2007) The Bayesian choice. Springer, New York. https://doi.org/10.1007/0-387-71599-1 Sankaran K, Holmes SP (2019) Latent variable modeling for the microbiome. Biostatistics 20:599–614. https://doi.org/10.1093/biostatistics/kxy018 Silverman J (2022) RcppHungarian: solves minimum cost bipartite matching problems. https://CRAN.R-project.org/package=RcppHungarian, R package version 0.2 Stephens M (2000) Dealing with label switching in mixture models. J R Stat Soc Ser B (Stat Methodol) 62:795–809. https://doi.org/10.1111/1467-9868.00265 Titterington DM, Wang B (2006) Convergence properties of a general algorithm for calculating variational Bayesian estimates for a Normal mixture model. Bayesian Anal. https://doi.org/10.1214/06-BA121 Tran MN, Nguyen TN, Dao VH (2021) A practical tutorial on variational Bayes. arXiv:2103.01327 van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605 Wainwright MJ, Jordan MI (2007) Graphical models, exponential families, and variational inference. Found Trends® Mach Learn 1:1–305. https://doi.org/10.1561/2200000001 Wallach H, Mimno D, McCallum A (2009) Rethinking LDA: why priors matter. In: Bengio Y, Schuurmans D, Lafferty J et al (eds) Advances in neural information processing systems, vol 22. Curran Associates Inc., New York Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2:165–193. https://doi.org/10.1007/s40745-015-0040-1 Zhang C, Butepage J, Kjellstrom H et al (2019) Advances in variational inference. IEEE Trans Pattern Anal Mach Intell 41:2008–2026. https://doi.org/10.1109/TPAMI.2018.2889774 Zhang C, Kjellström H (2015) How to supervise topic models. In: Agapito L, Bronstein MM, Rother C (eds) Computer vision—ECCv 2014 workshops. Springer, Cham, pp 500–515. https://doi.org/10.1007/978-3-319-16181-5_39