Emergent linguistic structure in artificial neural networks trained by self-supervision
Tóm tắt
This paper explores the knowledge of linguistic structure learned by large artificial neural networks, trained via self-supervision, whereby the model simply tries to predict a masked word in a given context. Human language communication is via sequences of words, but language understanding requires constructing rich hierarchical structures that are never observed explicitly. The mechanisms for this have been a prime mystery of human language acquisition, while engineering work has mainly proceeded by supervised learning on treebanks of sentences hand labeled for this latent structure. However, we demonstrate that modern deep contextual language models learn major aspects of this structure, without any explicit supervision. We develop methods for identifying linguistic hierarchical structure emergent in artificial neural networks and demonstrate that components in these models focus on syntactic grammatical relationships and anaphoric coreference. Indeed, we show that a linear transformation of learned embeddings in these models captures parse tree distances to a surprising degree, allowing approximate reconstruction of the sentence tree structures normally assumed by linguists. These results help explain why these models have brought such large improvements across many language-understanding tasks.
Từ khóa
Tài liệu tham khảo
O. Rambow, “The simple truth about dependency and phrase structure representations: An opinion piece” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, R. Kaplan, J. Burstein, M. Harper, G. Penn, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2010), pp. 337–340.
M. P. Marcus, B. Santorini, M. A. Marcinkiewicz, Building a large annotated corpus of English: The Penn treebank. Comput. Ling. 19, 313–330 (1993).
J. Nivre , “Universal dependencies V1: A multilingual treebank collection” in LREC International Conference on Language Resources and Evaluation, N. Calzolari , Eds. (European Language Resources Association, Paris, France, 2016), pp. 1659–1666.
D. Chen, C. D. Manning, “A fast and accurate dependency parser using neural networks” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, A. Moschitti, B. Pang, W. Daelemans, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2014), pp. 740–750.
T. Dozat C. D. Manning “Deep biaffine attention for neural dependency parsing.” https://openreview.net/pdf?id=Hk95PK9le. Accessed 21 May 2020.
J. Schmidhuber, “An on-line algorithm for dynamic reinforcement learning and planning in reactive environments” in Proceedings of the International Joint Conference on Neural Networks (IJCNN) (Institute of Electrical and Electronic Engineers, Piscataway, NJ, 1990), pp. 253–258.
D. Lieb, A. Lookingbill, S. Thrun, “Adaptive road following using self-supervised learning and reverse optical flow” in Proceedings of Robotics: Science and Systems (RSS), S. Thrun, G. S. Sukhatme, S. Schaal, Eds. (MIT Press, Cambridge, MA, 2005), pp. 273–280.
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, “Distributed representations of words and phrases and their compositionality” in Advances Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, K.Q. Weinberger, Eds. (Curran Associates, Red Hook, NY, 2013), pp. 3111–3119.
J. Pennington, R. Socher, C. Manning, “Glove: Global vectors for word representation” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, A. Moschitti, B. Pang, W. Daelemans, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2014), pp. 1532–1543.
M. Peters , “Deep contextualized word representations” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, M. Walker, H. Ji, A. Stent, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2018), pp. 2227–2237.
J. Devlin, M. W. Chang, K. Lee, K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, J. Burstein, C. Doran, T. Solorio, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2019), pp. 4171–4186.
N. Chomsky, Knowledge of Language: Its Nature, Origin, and Use (Praeger, New York, NY, 1986).
J. Devlin M.-W. Chang K. Lee K. Toutanova BERT. https://github.com/google-research/bert. Accessed 14 May 2020.
A. Vaswani , “Attention is all you need” in Advances in Neural Information Processing Systems 30, I. Guyon , Eds. (Curran Associates, Red Hook, NY, 2017), pp. 5998–6008.
J. Chung C. Gulcehre K. Cho Y. Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555 (11 Dececember 2014).
D. Bahdanau K. Cho Y. Bengio Neural machine translation by jointly learning to align and translate. arXiv:1409.0473 (16 January 2019).
K. Gulordava, P. Bojanowski, E. Grave, T. Linzen, M. Baroni, “Colorless green recurrent networks dream hierarchically” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, M. Walker, H. Ji, A. Stent, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2018), pp. 1195–1205.
R. Marvin, T. Linzen, “Targeted syntactic evaluation of language models” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, E. Riloff, D. Chiang, J. Hockenmaier, J. Tsujii, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2018), pp. 1192–1202.
A. Kuncoro , “LSTMs can learn syntax-sensitive dependencies well, but modeling structure makes them better” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, I. Gurevych, Y. Miyao, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2018), pp. 1426–1436.
Y. Goldberg Assessing BERT’s syntactic abilities. arXiv:1901.05287 (16 January 2019).
C. Phillips, M. W. Wagers, E. F. Lau, “Grammatical illusions and selective fallibility in real-time language comprehension” in Experiments at the Interfaces, Syntax and Semantics, J. Runner, Ed. (Emerald Group Publishing Limited, 2011), vol. 37, pp. 147–180.
S. Sharma R. Kiros R. Salakhutdinov Action recognition using visual attention. arxiv:1511.04119 (14 February 2016).
K. Xu , “Show, attend and tell: Neural image caption generation with visual attention” in Proceedings of the International Conference on Machine Learning, F. Bach, D. Blei, Eds. (Proceedings of Machine Learning Research, Brookline, MA, 2015), pp. 2048–2057.
J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, Y. Bengio, “Attention-based models for speech recognition” in Advances Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, R. Garnett, Eds. (Curran Associates, Red Hook, NY, 2015), pp. 577–585.
M. P. Marcus B. Santorini M. A. Marcinkiewicz A. Taylor Treebank-3. Linguistic Data Consortium LDC99T42. https://catalog.ldc.upenn.edu/LDC99T42. Accessed 14 May 2020.
M. C. de Marneffe, B. MacCartney, C. D. Manning, “Generating typed dependency parses from phrase structure parses” in LREC International Conference on Language Resources and Evaluation, N. Calzolari , Eds. (European Language Resources Association, Paris, France, 2006), pp. 449–454.
S. Pradhan, A. Moschitti, N. Xue, O. Uryupina, Y. Zhang, “CoNLL-2012 shared task: Modeling multilingual unrestricted coreference in Ontonotes” in Joint Conference on EMNLP and CoNLL – Shared Task, S. Pradhan, A. Moschitti, N. Xue, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2012), pp. 1–40.
H. Lee , “Stanford’s multi-pass sieve coreference resolution system at the CoNLL-2011 shared task” in Proceedings of the Conference on Computational Natural Language Learning: Shared Task, S. Pradhan, Ed. (Association for Computational Linguistics, Stroudsburg, PA, 2011), pp. 28–34.
A. Eriguchi, K. Hashimoto, Y. Tsuruoka, “Tree-to-sequence attentional neural machine translation” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, K. Erk, N. A. Smith, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2016), pp. 823–833.
K. Chen, R. Wang, M. Utiyama, E. Sumita, T. Zhao, “Syntax-directed attention for neural machine translation” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI Press, Palo Alto, CA, 2018), pp. 4792–4799.
J. Hewitt, C. D. Manning, “A structural probe for finding syntax in word representations” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, J. Burstein, C. Doran, T. Solorio, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2019), pp. 4129–4138.
E. Reif , “Visualizing and measuring the geometry of BERT” in Advances in Neural Information Processing Systems 32, H. Wallach .,Eds. (Curran Associates, Red Hook, NY, 2019), pp. 8594–8603.
T. Blevins, O. Levy, L. Zettlemoyer, “Deep RNNs encode soft hierarchical syntax” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, I. Gurevych, Y. Miyao, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2018), pp. 14–19.
N. F. Liu, M. Gardner, Y. Belinkov, M. E. Peters, N. A. Smith, “Linguistic knowledge and transferability of contextual representations” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, J. Burstein, C. Doran, T. Solorio, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2019), pp. 1073–1094.
I. Tenney “What do you learn from context? Probing for sentence structure in contextualized word representations.” https://openreview.net/pdf?id=SJzSgnRcKX. Accessed 21 May 2020.
N. Saphra, A. Lopez, “Understanding learning dynamics of language models with SVCCA” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, J. Burstein, C. Doran, T. Solorio, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2019), pp. 3257–3267.
K. W. Zhang, S. R. Bowman, “Language modeling teaches you more syntax than translation does: Lessons learned through auxiliary task analysis” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, E. Riloff, D. Chiang, J. Hockenmaier, J. Tsujii, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2018), pp. 359–361.
A. Conneau, G. Kruszewski, G. Lample, L. Barrault, M. Baroni, “What you can cram into a single \$&!#* vector: Probing sentence embeddings for linguistic properties” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, I. Gurevych, Y. Miyao, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2018), pp. 2126–2136.
Y. Belinkov, N. Durrani, F. Dalvi, H. Sajjad, J. Glass, “What do neural machine translation models learn about morphology?” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, R. Barzilay, M.-Y. Kan, Eds. (Association for Computational Linguistics, Stroudsburg, PA, 2017), pp. 861–872.
K. Clark BERT attention analysis. https://github.com/clarkkev/attention-analysis. Deposited 27 June 2019.
J. Hewitt Structural probes. https://github.com/john-hewitt/structural-probes. Deposited 27 May 2019.
K. Clark, U. Khandelwal, O. Levy, C. D. Manning, “What does BERT look at? An analysis of BERT’s attention” in Proceedings of the Second BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, T. Linzen, G. Chrupała, Y. Belinkov, D. Hupkes, Eds. (Association for Computational Linguistics, Stroudsburg PA, 2019), pp. 276–286.