Comprehensive analysis of embeddings and pre-training in NLP
Tài liệu tham khảo
Hinton, 2012, Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, IEEE Signal Process. Mag., 29, 82, 10.1109/MSP.2012.2205597
Dahl, 2011, Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition, IEEE Trans. Audio Speech Lang. Process., 20, 30, 10.1109/TASL.2011.2134090
J. Dean, G.S. Corrado, R. Monga, K. Chen, M. Devin, Q.V. Le, M.Z. Mao, M. Ranzato, A. Senior, P. Tucker, et al. Large scale distributed deep networks, in: Proceedings of the 25th International Conference on Neural Information Processing Systems-Volume 1, 2012, pp. 1223–1231.
Krizhevsky, 2012, Imagenet classification with deep convolutional neural networks, Adv. Neural Inf. Process. Syst., 25, 1097
LeCun, 1998, Gradient-based learning applied to document recognition, Proc. IEEE, 86, 2278, 10.1109/5.726791
Senior, 2020, Improved protein structure prediction using potentials from deep learning, Nature, 577, 706, 10.1038/s41586-019-1923-7
Agarap, 2018
Lu, 2018
Huang, 2015
Gregor, 2015, Draw: A recurrent neural network for image generation, 1462
Graves, 2005, Framewise phoneme classification with bidirectional LSTM and other neural network architectures, Neural Netw., 18, 602, 10.1016/j.neunet.2005.06.042
Bahdanau, 2014
Sutskever, 2014, Sequence to sequence learning with neural networks, 3104
Cho, 2014
Luong, 2015
Vaswani, 2017, Attention is all you need, 5998
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
Ba, 2016
Pan, 2009, A survey on transfer learning, IEEE Trans. Knowl. Data Eng., 22, 1345, 10.1109/TKDE.2009.191
Simonyan, 2014
Mikolov, 2013
Mikolov, 2013, Distributed representations of words and phrases and their compositionality, 3111
M. Baroni, G. Dinu, G. Kruszewski, Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors, in: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2014, pp. 238–247.
J. Pennington, R. Socher, C.D. Manning, Glove: Global vectors for word representation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543.
Deerwester, 1990, Indexing by latent semantic analysis, J. Am. Soc. Inf. Sci., 41, 391, 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
McCann, 2017
Peters, 2018
Weaver, 1949, Translation
Moro, 2014, Entity linking meets word sense disambiguation: A unified approach, Trans. Assoc. Comput. Linguist., 2, 231, 10.1162/tacl_a_00179
Jawahar, 2018, ELMoLex: Connecting ELMo and lexicon features for dependency parsing, 1
Hochreiter, 1991, Untersuchungen zu dynamischen neuronalen netzen, Diploma Tech. Univ. München, 91
Hochreiter, 2001
Radford, 2018
Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, S. Fidler, Aligning books and movies: Towards story-like visual explanations by watching movies and reading books, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 19–27.
K. Papineni, S. Roukos, T. Ward, J. Henderson, F. Reeder, Corpus-based comprehensive and diagnostic MT evaluation: initial Arabic, Chinese, French, and Spanish results, in: Proceedings of the Second International Conference on Human Language Technology Research, 2002, pp. 132–137.
Liu, 2018
Rocktäschel, 2015
Radford, 2019, Language models are unsupervised multitask learners, OpenAI Blog, 1, 9
Zhu, 2018
Alberti, 2019
C. Qu, L. Yang, M. Qiu, W.B. Croft, Y. Zhang, M. Iyyer, BERT with history answer embedding for conversational question answering, in: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2019, pp. 1133–1136.
Liu, 2019
Liu, 2019
Zhang, 2019
Sennrich, 2015
Brown, 2020
Kaplan, 2020
Devlin, 2018
Taylor, 1953, “Cloze procedure”: A new tool for measuring readability, J. Q., 30, 415
Wu, 2016
Liu, 2019
Trinh, 2018
S. Nagel, URL http://web.archive.org/save/http://commoncrawl.org/2016/10/newsdataset.
A. Gokaslan, V. Cohen, URL http://web.archive.org/save/http://Skylion007.github.io/OpenWebTextCorpus.
Reimers, 2019
Sanh, 2019
C. Buciluǎ, R. Caruana, A. Niculescu-Mizil, Model compression, in: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006, pp. 535–541.
Hinton, 2015
Lan, 2019
Hou, 2020
L. Yang, M. Zhang, C. Li, M. Bendersky, M. Najork, Beyond 512 tokens: Siamese multi-depth transformer-based hierarchical encoder for long-form document matching, in: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 2020, pp. 1725–1734.
Fedus, 2021
He, 2021
Yang, 2019, Xlnet: Generalized autoregressive pretraining for language understanding, Adv. Neural Inf. Process. Syst., 32
Raffel, 2020
Clark, 2020
Lee-Thorp, 2021
K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: A method for automatic evaluation of machine translation, in: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002, pp. 311–318.
C. Callison-Burch, M. Osborne, P. Koehn, Re-evaluating the role of BLEU in machine translation research, in: 11th Conference of the European Chapter of the Association for Computational Linguistics, 2006, pp. 249–256.
Rajpurkar, 2016
Lai, 2017
Zellers, 2018
Wang, 2018
McCann, 2018
Bengio, 1994, Learning long-term dependencies with gradient descent is difficult, IEEE Trans. Neural Netw., 5, 157, 10.1109/72.279181
Tang, 2016, Sequence-to-sequence model with attention for time series classification, 503
Harmon, 2018
Chiu, 2018, State-of-the-art speech recognition with sequence-to-sequence models, 4774
Zhou, 2018, A comparison of modeling units in sequence-to-sequence speech recognition with the transformer on mandarin chinese, 210
Mangal, 2019
Kotecha, 2018
F., 2020
