Joint dynamic topic model for recognition of lead-lag relationship in two text corpora
Tóm tắt
Topic evolution modeling has received significant attentions in recent decades. Although various topic evolution models have been proposed, most studies focus on the single document corpus. However in practice, we can easily access data from multiple sources and also observe relationships between them. Then it is of great interest to recognize the relationship between multiple text corpora and further utilize this relationship to improve topic modeling. In this work, we focus on a special type of relationship between two text corpora, which we define as the “lead-lag relationship". This relationship characterizes the phenomenon that one text corpus would influence the topics to be discussed in the other text corpus in the future. To discover the lead-lag relationship, we propose a joint dynamic topic model and also develop an embedding extension to address the modeling problem of large-scale text corpus. With the recognized lead-lag relationship, the similarities of the two text corpora can be figured out and the quality of topic learning in both corpora can be improved. We numerically investigate the performance of the joint dynamic topic modeling approach using synthetic data. Finally, we apply the proposed model on two text corpora consisting of statistical papers and the graduation theses. Results show the proposed model can well recognize the lead-lag relationship between the two corpora, and the specific and shared topic patterns in the two corpora are also discovered.
Tài liệu tham khảo
Ahmed A, Xing EP (2008) Dynamic non-parametric mixture models and the recurrent Chinese restaurant process: with applications to evolutionary clustering. In: Proceedings of the SIAM international conference on data mining, SDM 2008, April 24–26, 2008, Atlanta, pp 219–230
Ahmed A, Xing EP (2010) Timeline: a dynamic hierarchical Dirichlet process model for recovering birth/death and evolution of topics in text stream. In: Proceedings of the twenty-sixth conference on uncertainty in artificial intelligence, Catalina Island, July 8–11, pp 20–29
AlSumait L, Barbara D, Domeniconi C (2008) On-line LDA: adaptive topic models for mining text streams with applications to topic detection and tracking. In: Proceedings of the 8th IEEE international conference on data mining, pp 3–12
Ashley R, Granger CWJ, Schmalensee R (1980) Advertising and aggregate consumption: an analysis of causality. Econometrica 48(5):1149–1167
Blei DM, Lafferty JD (2006) Dynamic topic models. In: Proceedings of the twenty-third international conference (ICML 2006), Pittsburgh, June 25–29, pp 113–120
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Chae J, Thom D, Bosch H, et al (2012) Spatiotemporal social media analytics for abnormal event detection and examination using seasonal-trend decomposition. In: IEEE conference on visual analytics science & technology, pp 143–152
Chen J, Gong Z, Liu W (2019) A nonparametric model for online topic discovery with word embeddings. Inf Sci 504:32–47
Costa G, Ortale R (2021) Jointly modeling and simultaneously discovering topics and clusters in text corpora using word vectors. Inf Sci 563:226–240
Cryer JD, Chan KS (2008) Time series analysis: with applications in R. Springer
Dieng AB, Ruiz FJR, Blei DM (2019) The dynamic embedded topic model. arXiv preprint arXiv:1907.05545
Dieng AB, Ruiz FJR, Blei DM (2020) Topic modeling in embedding spaces. Trans Assoc Comput Linguist 8:439–453
Dubey A, Hefny A, Williamson S, et al (2013) A nonparametric mixture model for topic modeling over time. In: Proceedings of the SIAM international conference on data mining, pp 530–538
Granger CWJ (1969) Investigating causal relations by econometric models and cross-spectral methods. Econometrica 37(3):424–438
He J, Chen X, Du M et al (2015) Topic evolution analysis based on improved online LDA model. J Cent South Univ (Sci Technol) 46(2):547–553
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Jordan MI, Ghahramani Z, Jaakkola TS et al (1999) An introduction to variational methods for graphical models. Mach Learn 37(2):183–233
Kalman RE (1960) A new approach to linear filtering and prediction problems. J Basic Eng 82(1):35–45
Kawamae N (2011) Trend analysis model: trend consists of temporal words, topics, and timestamps. In: Proceedings of the fourth ACM international conference on web search and data mining, pp 317–326
Kingma DP, Ba JL (2015) Adam: a method for stochastic optimization. In: International conference on learning representations, 2015, pp 1–15
Kingma DP, Welling M (2013) Auto-encoding variational Bayes. In: International conference on learning representations(ICLR), pp 1–14
Meng H, Xu HC, Zhou WX et al (2017) Symmetric thermal optimal path and time-dependent lead-lag relationship: novel statistical tests and application to uk and us real-estate and monetary policies. Quant Finance 17(6):959–977
Mohamad S, Bouchachia A (2020) Online gaussian lda for unsupervised pattern mining from utility usage data. In: 2020 19th IEEE international conference on machine learning and applications (ICMLA), IEEE, pp 41–48
Nallapati RM, Ditmore S, Lafferty JD, et al (2007) Multiscale topic tomography. In: ACM Sigkdd international conference on knowledge discovery & data mining, pp 520–529
Pozdnoukhov A, Kaiser C (2011) Space-time dynamics of topics in streaming text. In: ACM Sigspatial international workshop on location-based social networks, pp 1–8
Rudolph M, Blei D (2018) Dynamic embeddings for language evolution. In: Proceedings of the 2018 world wide web conference, pp 1003–1011
Runge J, Bathiany S, Bollt E et al (2019) Inferring causation from time series in Earth system sciences. Nat Commun 10(1):2553–2553
Sasaki K, Yoshikawa T, Furuhashi T (2014) Online topic model for Twitter considering dynamics of user interests and topic trends. In: Proceedings of the conference on empirical methods in natural language processing, pp 1977–1985
Saul LK, Jordan MI (1995) Exploiting tractable substructures in intractable networks. Adv Neural Inf Process Syst 8:486–492
Sornette D, Zhou W (2005) Non-parametric determination of real-time lag structure between two time series: the “optimal thermal causal path’’ method. Quantit Finance 5(6):577–591
Sugihara G, May RM, Ye H et al (2012) Detecting causality in complex ecosystems. Science 338(6106):496–500
Vavliakis KN, Tzima FA, Mitkas PA (2012) Event detection via LDA for the mediaeval 2012 sed task. In: MediaEval workshop, pp 1–2
Wallach HM, Murray I, Salakhutdinov R, et al (2009) Evaluation methods for topic models. In: Proceedings of the 26th annual international conference on machine learning, pp 1–8
Wang C, Blei D, Heckerman D (2008) Continuous time dynamic topic models. Uncertainty in Artificial Intelligence, pp 579–586
Wang X, McCallum A (2006) Topics over time: A non-Markov continuous-time model of topical trends. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 424–433
Yang M, Qu Q, Chen X et al (2019) Discovering author interest evolution in order-sensitive and Semantic-aware topic modeling. Inf Sci 486:271–286. https://doi.org/10.1016/j.ins.2019.02.040
Ye H, Deyle ER, Gilarranz LJ et al (2015) Distinguishing time-delayed causal interactions using convergent cross mapping. Sci Rep 5(1):14750
Zhou H, Huimin YU, Roland HU (2017) Topic evolution based on the probabilistic topic model: a review. Front Comput Sci 11(5):786–802
Zhou X, Chen L (2014) Event detection over Twitter social media streams. VLDB J 23(3):381–400