Integrating semantic directions with concept mover’s distance to measure binary concept engagement
Tóm tắt
In an earlier article published in this journal (“Concept Mover’s Distance”, 2019), we proposed a method for measuring concept engagement in texts that uses word embeddings to find the minimum cost necessary for words in an observed document to “travel” to words in a “pseudo-document” consisting only of words denoting a concept of interest. One potential limitation we noted is that, because words associated with opposing concepts will be located close to one another in the embedding space, documents will likely have similar closeness to starkly opposing concepts (e.g., “life” and “death”). Using aggregate vector differences between antonym pairs to extract a direction in the semantic space pointing toward a pole of the binary opposition (following “The Geometry of Culture,” American Sociological Review, 2019), we illustrate how CMD can be used to measure a document’s engagement with binary concepts.
Tài liệu tham khảo
Arseniev-Koehler, A., & Foster, J. (2020). Machine learning as a model for cultural learning: Teaching an algorithm what it means to be fat. SocArXiv. https://osf.io/preprints/socarxiv/c9yj3/.
Atasu, K., Parnell, T., Dünner, C., Sifalakas, M., Pozidis, H., Vasileiadis, V., et al. (2017). Linear-complexity related word mover's distance with GPU acceleration. In J.-Y. Nie, Z. Obradovic, T. Suzumura, R. Ghosh, R. Nambiar, C. Wang, et al. (Eds.), 2017 IEEE international conference on big data (pp. 889–896). Boston: IEEE.
Bolukbasi, T., Chang, K.-W., Zou, J., Saligrama, V., & Kalai, A. (2016). Quantifying and reducing stereotypes in word embeddings. arXiv. https://arxiv.org/abs/1606.06121.
Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., & Kalai, A. T. (2016). Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 29, pp. 4349–4357). Curran Associates Inc.
Caliskan, A., Bryson, Joanna J, & Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334), 183–186.
Ethayarajh, K., Duvenaud, D., & Hirst, G. (2019). Understanding undesirable word embedding associations. arXiv. https://arxiv.org/abs/1908.06361.
Garg, N., Schiebinger, L., Jurafsky, D., & Zou, J. (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences of the United States of America, 115(16), E3635–E3644.
Goldberg, A. (2011). Mapping shared understandings using relational class analysis: The case of the cultural omnivore reexamined. American Journal of Sociology, 116(5), 1397–1436.
Kassambara, A. (2020). ggpubr: ‘ggplot2’ based publication ready plots. R package version 0.2.5. https://cran.r-project.org/web/packages/ggpubr/ggpubr.pdf. Accessed 11 June 2020.
Kozlowski, A. C., Taddy, M., & Evans, J. A. (2019). The geometry of culture: Analyzing the meanings of class through word embeddings. American Sociological Review, 84(5), 905–949.
Kusner, M., Sun, Y., Kolkin, N., & Weinberger, K. (2015). From word embeddings to document distances. In: International conference on machine learning (pp. 957–966).
Lakoff, George. (2010). Moral politics: How liberals and conservatives think. Chicago: University of Chicago Press.
Larsen, A. B. L., Sønderby, S. K., Larochelle, H., & Winther, O. (2016). Autoencoding beyond pixels using a learned similarity metric. In M. F. Balcan & K. Q. Weinberger (Eds.), Proceedings of the 33rd international conference on machine learning (pp. 1558–1566). New York: ACM.
Makrai, M., Nemeskey, D., & Kornai, A. (2013). Applicative structure in vector space models. In A. Allauzen, H. Larochelle, C. Manning, & R. Socher (Eds.), Proceedings of the workshop on continuous vector space models and their compositionality (pp. 59–63). Sofia, Bulgaria: ACL.
Mikolov, T, Yih, W.-T., & Zweig, G. (2013). Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the north American chapter of the association for computational linguistics: Human language technologies (pp. 746–751). aclweb.org.
Project Gutenberg. 2020. https://www.gutenberg.org/wiki/Main_Page.
Rubner, Y., Tomasi, C., & Guibas L. J. (1998). A metric for distributions with applications to image databases. In Sixth international conference on computer vision (IEEE Cat. No. 98CH36271) (pp. 59–66). IEEE.
Sahlgren, Magnus. (2008). The distributional hypothesis. Italian Journal of Disability Studies, 20, 33–53.
Selivanov, D., Bickel, M., & Wang, Q. (2020) text2vec: Modern text mining framework for R. R package version 0.6. https://cran.r-project.org/web/packages/text2vec/text2vec.pdf. Accessed 11 June 2020.
Stoltz, D. S., & Taylor, M. A. (2019). Concept mover’s distance: measuring concept engagement via word embeddings in texts. Journal of Computational Social Science, 2(2), 293–313.
Venables, W. N., & Ripley, B. D. (2002). Modern Applied Statistics with S (4th ed.). New York: Springer. (ISBN 0-387-95457-0).
Wickham, Hadley. (2016). ggplot2: Elegant Graphics for Data Analysis. New York: Springer.
Woolley, J. T, & Peters, G. (2008). The American presidency project, Santa Barbara. Available from: http://www.presidency.ucsb.edu/ws.