Learning author-topic models from text corpora

ACM Transactions on Information Systems - Tập 28 Số 1 - Trang 1-38 - 2010
Michal Rosen-Zvi1, Chaitanya Chemudugunta2, Thomas Griffiths3, Padhraic Smyth2, Mark Steyvers2
1IBM Research Laboratory in Haifa
2University of California, Irvine
3University of California, Berkeley

Tóm tắt

We propose an unsupervised learning technique for extracting information about authors and topics from large text collections. We model documents as if they were generated by a two-stage stochastic process. An author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words. The probability distribution over topics in a multi-author paper is a mixture of the distributions associated with the authors. The topic-word and author-topic distributions are learned from data in an unsupervised manner using a Markov chain Monte Carlo algorithm. We apply the methodology to three large text corpora: 150,000 abstracts from the CiteSeer digital library, 1740 papers from the Neural Information Processing Systems (NIPS) Conferences, and 121,000 emails from the Enron corporation. We discuss in detail the interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, parsing of abstracts by topics and authors, and detection of unusual papers by specific authors. Experiments based on perplexity scores for test documents and precision-recall for document retrieval are used to illustrate systematic differences between the proposed author-topic model and a number of alternatives. Extensions to the model, allowing for example, generalizations of the notion of an author, are also briefly discussed.

Từ khóa


Tài liệu tham khảo

10.1137/1037127

Blei D. and Lafferty J. 2006a. Correlated topic models. In Advances in Neural Information Processing Systems 18 Y. Weiss B. Schölkopf and J. Platt Eds. MIT Press Cambridge MA 147--154. Blei D. and Lafferty J. 2006a. Correlated topic models. In Advances in Neural Information Processing Systems 18 Y. Weiss B. Schölkopf and J. Platt Eds. MIT Press Cambridge MA 147--154.

Blei , D. and Lafferty , J . 2006b. Correlated topic models . In Proceedings of the 23rd International Conference on Machine Learning. ACM Press , New York, NY, 113--120. Blei, D. and Lafferty, J. 2006b. Correlated topic models. In Proceedings of the 23rd International Conference on Machine Learning. ACM Press, New York, NY, 113--120.

10.5555/944919.944937

Box G. E. P. and Tiao G. C. 1973. Bayesian Inference in Statistical Analysis. Addison-Wesley Reading MA. Box G. E. P. and Tiao G. C. 1973. Bayesian Inference in Statistical Analysis. Addison-Wesley Reading MA.

Brooks , S. 1998 . Markov chain Monte Carlo method and its application . Statistician 47 , 69 -- 100 . Brooks, S. 1998. Markov chain Monte Carlo method and its application. Statistician 47, 69--100.

Buntine , W. L. and Jakulin , A . 2004. Applying discrete PCA in data analysis . In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. M. Chickering and J. Halpern Eds. Morgan Kaufmann Publishers , San Francisco, CA, 59--66. Buntine, W. L. and Jakulin, A. 2004. Applying discrete PCA in data analysis. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. M. Chickering and J. Halpern Eds. Morgan Kaufmann Publishers, San Francisco, CA, 59--66.

10.1145/1008992.1009016

Chemudugunta C. Smyth P. and Steyvers M. 2007. Modeling general and specific aspects of documents with a probabilistic topic model. In Advances in Neural Information Processing Systems 19 B. Schölkopf J. Platt and T. Hoffman Eds. MIT Press Cambridge MA 241--248. Chemudugunta C. Smyth P. and Steyvers M. 2007. Modeling general and specific aspects of documents with a probabilistic topic model. In Advances in Neural Information Processing Systems 19 B. Schölkopf J. Platt and T. Hoffman Eds. MIT Press Cambridge MA 241--248.

Cohn D. and Hofmann T. 2001. The missing link—a probabilistic model of document content and hypertext connectivity. In Advances in Neural Information Processing Systems 13 T. K. Leen T. G. Dietterich and V. Tresp Eds. MIT Press Cambridge MA 430--436. Cohn D. and Hofmann T. 2001. The missing link—a probabilistic model of document content and hypertext connectivity. In Advances in Neural Information Processing Systems 13 T. K. Leen T. G. Dietterich and V. Tresp Eds. MIT Press Cambridge MA 430--436.

10.1145/133160.133214

10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9

10.5555/599609.599630

10.1023/A:1023824908771

10.1073/pnas.0307760101

Erten C. Harding P. J. Kobourov S. G. Wampler K. and Yee G. 2003. Exploring the computing literature using temporal graph visualization. Tech. rep. Department of Computer Science University of Arizona. Erten C. Harding P. J. Kobourov S. G. Wampler K. and Yee G. 2003. Exploring the computing literature using temporal graph visualization. Tech. rep. Department of Computer Science University of Arizona.

Gilks W. Richardson S. and Spiegelhalter D. 1996. Markov Chain Monte Carlo in Practice. Chapman & Hall New York NY. Gilks W. Richardson S. and Spiegelhalter D. 1996. Markov Chain Monte Carlo in Practice. Chapman & Hall New York NY.

Gray , A. , Sallis , P. , and MacDonell , S. 1997 . Software forensics: Extending authorship analysis techniques to computer programs . In Proceedings of the 3rd Biannual Conference of the International Association of Forensic Linguists (IAFL). 1--8. Gray, A., Sallis, P., and MacDonell, S. 1997. Software forensics: Extending authorship analysis techniques to computer programs. In Proceedings of the 3rd Biannual Conference of the International Association of Forensic Linguists (IAFL). 1--8.

10.1073/pnas.0307752101

Griffiths , T. L. , Steyvers , M. , Blei , D. M. , and Tenenbaum , J. B. 2005 . Integrating topics and syntax . In Advances in Neural Information Processing Systems 17 , L . K. Saul, Y. Weiss, and L. Bottou Eds., MIT Press, Cambridge, MA. Griffiths, T. L., Steyvers, M., Blei, D. M., and Tenenbaum, J. B. 2005. Integrating topics and syntax. In Advances in Neural Information Processing Systems 17, L. K. Saul, Y. Weiss, and L. Bottou Eds., MIT Press, Cambridge, MA.

Gruber , A. , Rosen-Zvi , M. , and Weiss , Y . 2007. Hidden topic Markov models . In Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS). Gruber, A., Rosen-Zvi, M., and Weiss, Y. 2007. Hidden topic Markov models. In Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS).

10.1145/312624.312649

10.1093/llc/13.3.111

10.1109/89.736328

10.1145/245108.245123

10.1093/llc/9.2.119

10.1023/A:1006586221250

10.1109/2.769447

10.1038/44565

10.1145/1143844.1143917

10.1002/(SICI)1097-4571(199009)41:6<433::AID-ASI11>3.0.CO;2-Q

McCallum , A. 1999 . Multi-label text classification with a mixture model trained by EM . In AAAI Workshop on Text Learning. McCallum, A. 1999. Multi-label text classification with a mixture model trained by EM. In AAAI Workshop on Text Learning.

McCallum , A. , Corrada-Emmanuel , A. , and Wang , X . 2005. Topic and role discovery in social networks . In Proceedings of the 19th International Joint Conference on Artificial Intelligence, 786--791 . McCallum, A., Corrada-Emmanuel, A., and Wang, X. 2005. Topic and role discovery in social networks. In Proceedings of the 19th International Joint Conference on Artificial Intelligence, 786--791.

10.1145/347090.347123

10.1145/1150402.1150482

Minka , T. and Lafferty , J . 2002. Expectation-propagation for the generative aspect model . In Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers , San Francisco, CA, 352--359. Minka, T. and Lafferty, J. 2002. Expectation-propagation for the generative aspect model. In Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers, San Francisco, CA, 352--359.

Mosteller F. and Wallace D. 1964. Inference and Disputed Authorship: The Federalist Papers. Addison-Wesley Reading MA. Mosteller F. and Wallace D. 1964. Inference and Disputed Authorship: The Federalist Papers. Addison-Wesley Reading MA.

Mutschke , P. 2003. Mining networks and central entities in digital libraries: a graph theoretic approach applied to co-author networks. Advanced in Intelligent Data Analysis , V, Lecture Notes in Computer Science , vol. 2810 , Springer Verlag , 155--166. Mutschke, P. 2003. Mining networks and central entities in digital libraries: a graph theoretic approach applied to co-author networks. Advanced in Intelligent Data Analysis, V, Lecture Notes in Computer Science, vol. 2810, Springer Verlag, 155--166.

10.1103/PhysRevE.64.016131

10.1145/290941.291008

Popescul , A. , Ungar , L. H. , Flake , G. W. , Lawrence , S. , and Giles , C. L . 2000. Clustering and identifying temporal trends in document databases . In Proceedings of the IEEE Advances in Digital Libraries 2000 . IEEE Computer Society, Los Alamitos, CA, 173--182. Popescul, A., Ungar, L. H., Flake, G. W., Lawrence, S., and Giles, C. L. 2000. Clustering and identifying temporal trends in document databases. In Proceedings of the IEEE Advances in Digital Libraries 2000. IEEE Computer Society, Los Alamitos, CA, 173--182.

10.1093/genetics/155.2.945

10.1111/1467-9868.00070

Robertson , S. E. , Walker , S. , Jones , S. , Hancock-Beaulieu , M. M. , and Gatford , M . 1995. Okapi at TREC-3 . In Proceedings of TREC. 109--126 . Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., and Gatford, M. 1995. Okapi at TREC-3. In Proceedings of TREC. 109--126.

Rosen-Zvi , M. , Griffiths , T. , Steyvers , M. , and Smyth , P . 2004. The author-topic model for authors and documents . In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, M. Chickering and J. Halpern, Eds. Morgam Kaufmann , San Francisco, CA, 487--494. Rosen-Zvi, M., Griffiths, T., Steyvers, M., and Smyth, P. 2004. The author-topic model for authors and documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, M. Chickering and J. Halpern, Eds. Morgam Kaufmann, San Francisco, CA, 487--494.

10.1016/S0306-4573(00)00015-7

10.1145/1014052.1014087

Teh , Y. W. , Jordan , M. I. , Beal , M. J. , and Blei , D. M. 2005 . Hierarchical Dirichlet processes . In Advances in Neural Information Processing Systems 17 , L . K. Saul, Y. Weiss, and L. Bottou Eds., MIT Press, Cambridge, MA. Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. 2005. Hierarchical Dirichlet processes. In Advances in Neural Information Processing Systems 17, L. K. Saul, Y. Weiss, and L. Bottou Eds., MIT Press, Cambridge, MA.

10.1093/biomet/74.3.445

Ueda N. and Saito K. 2003. Parametric mixture models for multi-labeled text. In Advances in Neural Information Processing Systems 15 S. Becker S. Thrun and K. Obermayer Eds. MIT Press Cambridge MA 721--728. Ueda N. and Saito K. 2003. Parametric mixture models for multi-labeled text. In Advances in Neural Information Processing Systems 15 S. Becker S. Thrun and K. Obermayer Eds. MIT Press Cambridge MA 721--728.

10.1145/1148170.1148204

10.1145/956750.956782

10.1023/A:1009982220290