Multi-stage transfer learning with BERTology-based language models for question answering system in vietnamese
Tóm tắt
With the fast growth of information science and engineering, a large number of textual data generated are valuable for natural language processing and its applications. Particularly, finding correct answers to natural language questions or queries requires spending tremendous time and effort in human life. While using search engines to discover information, users manually determine the answer to a given question on a range of retrieved texts or documents. Question answering relies heavily on the capability to automatically comprehend questions in human language and extract meaningful answers from a single text. In recent years, such question–answering systems have become increasingly popular using machine reading comprehension techniques. On the other hand, high-resource languages (e.g., English and Chinese) have witnessed tremendous growth in question-answering methodologies based on various knowledge sources. Besides, powerful BERTology-based language models only encode texts with a limited length. The longer texts contain more distractor sentences that affect the QA system performance. Vietnamese has a variety of question words in the same question type. To address these challenges, we propose ViQAS, a new question–answering system with multi-stage transfer learning using language models based on BERTology for a low-resource language such as Vietnamese. Last but not least, our QA system is integrated with Vietnamese characteristics and transformer-based evidence extraction techniques into an effective contextualized language model-based QA system. As a result, our proposed system outperforms our forty retriever-reader QA configurations and seven state-of-the-art QA systems such as DrQA, BERTserini, BERTBM25, XLMRQA, ORQA, COBERT, and NeuralQA on three Vietnamese benchmark question answering datasets.
Tài liệu tham khảo
Alzubi JA, Jain R, Singh A, Parwekar P, Gupta M (2021) Cobert: covid-19 question answering system using bert. Arab J Sci Eng:1–11
Antol S, Agrawal A, Lu J, Mitchell M, Batr D, Zitnick CL, Parikh D (2015) Vqa: visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433
Bach NX, Thanh PD, Oanh TT (2020) Question analysis towards a vietnamese question answering system in the education domain. Cybern Inform Technol 20(1):112–128
Bai Y, Wang DZ (2021) More than reading comprehension: A survey on datasets and metrics of textual question answering. arXiv preprint arXiv:2109.12264
Bizer C, Lehmann J, Kobilarov G, Auer S, Becker C, Cyganiak R, Hellmann S (2009) Dbpedia—a crystallization point for the web of data. J Web Seman 7(3):154–165
Bollacker K, Evans C, Paritosh P, Sturge T, Taylor J (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp 1247–1250
Braslavski P (2020) Sberquad–russian reading comprehension dataset: Description and analysis. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction: 11th International Conference of the CLEF Association, CLEF 2020, Thessaloniki, Greece, September 22-25, 2020, Proceedings, vol. 12260, pp 3. Springer Nature
Chen D, Bolton J, Manning CD (2016) A thorough examination of the cnn/daily mail reading comprehension task. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 2358–2367
Chen D, Fisch A, Weston J, Bordes A (2017) Reading wikipedia to answer open-domain questions. Proc ACL 2017:1870–1879
Chen D, Yih W-T (2020) Open-domain question answering. Proc ACL 2020:34–37
Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán Francisco, Grave Édouard, Ott Myle, Zettlemoyer Luke, Stoyanov Veselin (2020) Unsupervised cross-lingual representation learning at scale. Proc ACL 2020:8440–8451
Cui Y, Liu T, Che W, Xiao L, Chen , Ma W, Wang S, Hu G (2019) A span-extraction dataset for Chinese machine reading comprehension. In Proceedings of EMNLP-IJCNLP 2019, pp 5883–5889, Hong Kong, Chinar. Association for Computational Linguistics
Cui Y, Liu T, Che W, Xiao L, Chen Z, Ma Wentao, Wang Shijin, Guoping Hu (2019) A span-extraction dataset for chinese machine reading comprehension. Proc EMNLP-IJCNLP 2019:5883–5889
Das R, Dhuliawala S, Zaheer M, McCallum A (2018) Multi-step retriever-reader interaction for scalable open-domain question answering. In: ICLR
Devlin J, Chang M-W, Lee K, Toutanova K (2019) Bert: pre-training of deep bidirectional transformers for language understanding. Proc NAACL 2019:4171–4186
d’Hoffschmidt M, Belblidia W, Heinrich Q, Brendlé T, Vidal M (2020) FQuAD: french question answering dataset. In: EMNLP 2020 (Findings), pp 1193–1208, Online. Association for Computational Linguistics
Dibia V (2020) Neuralqa: a usable library for question answering (contextual query expansion+ bert) on large datasets. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 15–22
Do Phong N-T, Nguyen Nhat D, Van Huynh T, Van Nguyen K, Nguyen Anh G-T, Nguyen Ngan L-T (2021) Sentence extraction-based machine reading comprehension for vietnamese. In Han Qiu, Cheng Zhang, Zongming Fei, Meikang Qiu, and Sun-Yuan Kung, editors, Knowledge Science, Engineering and Management - 14th International Conference, KSEM 2021, Tokyo, Japan, August 14-16, 2021, Proceedings, Part II, volume 12816 of Lecture Notes in Computer Science, pp 511–523. Springer
Do Phong N-T, Nguyen ND, Van Huynh T, Van Nguyen K, Gia-Tuan NA, Nguyen Ngan L-T (2021) Sentence extraction-based machine reading comprehension for vietnamese. Knowl Sci Eng Manag. In: 14th International Conference
Do P, Phan THV (2022) Developing a bert based triple classification model using knowledge graph embedding for question answering system. Appl Intell 52(1):636–651
Do P, Phan THV, Gupta BB (2021) Developing a vietnamese tourism question answering system using knowledge graph and deep learning. Transa Asian Low-Resou Lang Inform Process 20(5):1–18
Doan AL, Luu ST (2022) Improving sentiment analysis by emotion lexicon approach on vietnamese texts. arXiv preprint arXiv:2210.02063
Dua D, Wang Y, Dasigi P, Stanovsky G, Singh S, Gardner M (2019) Drop: a reading comprehension benchmark requiring discrete reasoning over paragraphs. In: NAACL-HLT (1)
Duong Huu-Thanh, Ho Bao-Quoc (2015) A vietnamese question answering system in vietnam’s legal documents. In: IFIP International Conference on Computer Information Systems and Industrial Management, pp 186–197. Springer
d’Hoffschmidt M, Belblidia W, Heinrich Q, Brendlé T, Vidal M (2020) Fquad: French question answering dataset. In: Proceedings of EMNLP 2020 (Findings), pp 1193–1208
Efimov P, Chertok A, Boytsov L, Braslavski P (2020) Sberquad–russian reading comprehension dataset: description and analysis. In: International Conference of the Cross-Language Evaluation Forum for European Languages, pp 3–15. Springer
Feldman Y, El-Yaniv R (2019) Multi-hop paragraph retrieval for open-domain question answering. Proc ACL 2019:2296–2309
Green Jr BF, Wolf AK, Chomsky C, Laughery K (1961) Baseball: an automatic question-answerer. In: Papers presented at the May 9-11, 1961, western joint IRE-AIEE-ACM computer conference, pp 219–224
Guu K, Lee K, Tung Z, Pasupat P, Chang M (2020) Retrieval augmented language model pre-training. In: International Conference on Machine Learning, pp 3929–3938. PMLR
Harabagiu S, Moldovan D, Clark C, Bowden M, Williams J, Bensley J (2003) Answer mining by combining extraction techniques with abductive reasoning. Proc. TREC 2003:375–382
Harabagiu S, Pasca M, Maiorano SJ (2000) Experiments with open-domain textual question answering. In: COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics
Hedderich MA, Lange L, Adel H, Strötgen J, Klakow D (2021) A survey on recent approaches for natural language processing in low-resource scenarios. In: Proceedings of NAACL 2021, pp 2545–2568
Hermann KM, Kocisky T, Grefenstette E, Espeholt L, Kay W, Suleyman M, Blunsom P (2015) Teaching machines to read and comprehend. Adv Neural Inform Process Systems 28
Howard J, Ruder S (2018) Universal language model fine-tuning for text classification. In: Proceedings of ACL 2018 (Volume 1: Long Papers), pp 328–339
Huang H-Y, Zhu C, Shen Y, Weizhu C (2018) Fusing via fully-aware attention with application to machine comprehension. In: ICLR, Fusionnet
Izacard G, Grave E (2021) Distilling knowledge from reader to retriever for question answering. In: ICLR 2021
Izacard G, Grave É (2021) Leveraging passage retrieval with generative models for open domain question answering. Proc EACL 2021:874–880
Kafle K, Kanan C (2017) Visual question answering: datasets, algorithms, and future challenges. Comput Vis Image Understand 163:3–20
Karpukhin V, Oguz B, Min S, Lewis P, Ledell W, Edunov S, Chen D, Yih W-T (2020) Dense passage retrieval for open-domain question answering. Proc EMNLP 2020:6769–6781
Kratzwald B, Eigenmann A, Feuerriegel S (2019) Rankqa: neural question answering with answer re-ranking. Proc ACL 2019:6076–6085
Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R (2019) Albert: a lite bert for self-supervised learning of language representations. In: ICLR 2019
Lee K, Chang M-W, Toutanova K (2019) Latent retrieval for weakly supervised open domain question answering. Proc ACL 2019:6086–6096
Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, Küttler H, Lewis M, Yih W-t, Rocktäschel T et al (2020) Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv Neural Inform Process Syst 33:9459–9474
Lim S, Kim M, Lee J (2019) Korquad1.0: Korean qa dataset for machine reading comprehension. arXiv preprint arXiv:1909.07005
Lin J, Ma X, Lin S-C, Yang J-H, Pradeep R, Nogueira R (2021) Pyserini: a python toolkit for reproducible information retrieval research with sparse and dense representations. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 2356–2362
Lin Y, Ji H, Liu Z, Sun M (2018) Denoising distantly supervised open-domain question answering. Proc ACL 2018:1736–1745
Liu S, Zhang X, Zhang S, Wang H, Zhang W (2019) Neural machine reading comprehension: methods and trends. Appl Sci 9(18):3698
Messaoudi A, Haddad H, Ben Haj HM (2020) icompass at semeval-2020 task 12: from a syntax-ignorant n-gram embeddings model to a deep bidirectional language model. In: Proceedings of the Fourteenth Workshop on Semantic Evaluation, pp 1978–1982
Min S, Chen D, Zettlemoyer L, Hajishirzi H (2019) Knowledge guided text retrieval and reading for open domain question answering. arXiv preprint arXiv:1911.03868
Nguyen DQ, Tuan NA (2020) PhoBERT: pre-trained language models for Vietnamese. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp 1037–1042, Online. Association for Computational Linguistics
Nguyen Kiet, Nguyen Vu, Nguyen Anh, Nguyen Ngan (2020) A vietnamese dataset for evaluating machine reading comprehension. In: Proceedings of the 28th International Conference on Computational Linguistics, pp 2595–2605
Van Nguyen K, Do Phong N-T, Nguyen ND, Van Huynh T, Nguyen AG-T, Nguyen Ngan L-T (2022) Xlmrqa: Open-domain question answering on vietnamese wikipedia-based textual knowledge source. In: the 14th Asian Conference on Intelligent Information and Database Systems (Accepted)
Nogueira R, Cho K (2019) Passage re-ranking with bert. arXiv preprint arXiv:1901.04085
Noraset T, Lowphansirikul L, Tuarob S (2021) Wabiqa: a wikipedia-based thai question-answering system. Inform Process Manag 58(1):102431
Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of NAACL 2018, Volume 1 (Long Papers), pp 2227–2237
Phan T, Do P (2021) Building a vietnamese question answering system based on knowledge graph and distributed cnn. Neural Comput Appl: 1–21
Pyysalo S, Kanerva J, Virtanen A, Ginter F (2021) Wikibert models: deep transfer learning for many languages. NoDaLiDa 2021, pp 1
Rajpurkar P, Zhang J, Lopyrev K, Liang P (2016) Squad: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp 2383–2392
Reimers N, Gurevych I (2019) Sentence-bert: sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Assoc Comput Linguist 11
Reimers N, Gurevych I (2020) Making monolingual sentence embeddings multilingual using knowledge distillation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11
Richardson M, Burges Christopher JC, Renshaw E (2013) Mctest: a challenge dataset for the open-domain machine comprehension of text. In: Proceedings of the 2013 conference on empirical methods in natural language processing, pp 193–203
Rogers A, Kovaleva O, Rumshisky A (2020) A primer in bertology: what we know about how bert works. TACL 8:842–866
Seo M, Kembhavi A, Farhadi A, Hajishirzi H (2016) Bidirectional attention flow for machine comprehension. arXiv preprint. arXiv:1611.01603
So BH, Byun K, Kang K, Cho S (2022) Jaquad: Japanese question answering dataset for machine reading comprehension. arXiv preprint. arXiv:2202.01764
Tapeh AG, Rahgozar M (2008) A knowledge-based question answering system for b2c ecommerce. Knowl Based Syst 21(8):946–950
Tran M-V, Le D-T, Tran XT, Nguyen T-T (2012) A model of vietnamese person named entity question answering system. In: Proceedings of PACLIC 2012, pp 325–332
Tran TK (2015) Sentivoice-a system for querying hotel service reviews via phone. In: RIVF 2015, pp 65–70. IEEE
Trotman A, Puurula A, Burgess B (2014) Improvements to bm25 and language models examined. In: Proceedings of the 2014 Australasian Document Computing Symposium, pp 58–65
Van HT, Van Nguyen K, Nguyen NL-T (2022) Vinli: a vietnamese corpus for studies on open-domain natural language inference. In: Proceedings of the 29th International Conference on Computational Linguistics, pp 3858–3872
Van Nguyen K, Nguyen ND, Do PN-T, Nguyen AG-T, Nguyen NL-T (2021) Vireader: a wikipedia-based vietnamese reading comprehension system using transfer learning. J Intell Fuzzy Syst 41:1–19
Van Nguyen K, Tran KV, Luu ST, Nguyen AG-T, Nguyen NL-T (2020) Enhancing lexical-based approach with external knowledge for vietnamese multiple-choice machine reading comprehension. IEEE Access 8:201404–201417
Van Nguyen K, Van Huynh T, Nguyen D-V, Nguyen AG-T, Nguyen NL-T (2022) New vietnamese corpus for machine reading comprehension of health news articles. Trans Asian Low-Resour Lang Inform Process 21(5):1–28
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Voorhees Ellen M et al. (1999) The trec-8 question answering track report. In: Trec, vol. 99, pp 77–82
Wang H, Dian Y, Sun K, Chen J, Dong Y, McAllester D, Roth D (2019) Evidence sentence extraction for machine reading comprehension. Proc CoNLL 2019:696–707
Wang S, Yu M, Guo X, Wang Z, Klinger T, Zhang W, Chang S, Tesauro G, Zhou B, Jiang J (2018) R3: reinforced ranker-reader for open-domain question answering. In: AAAI 2018
Wang Z, Ng P, Ma X, Nallapati R, Xiang B (2019) Multi-passage bert: a globally normalized bert model for open-domain question answering. Proc EMNLP-IJCNLP 2019:5878–5882
Woods WA (1973) Progress in natural language understanding: an application to lunar geology. In: Proceedings of the June 4-8, 1973, national computer conference and exposition, pp 441–450
Wu B, Zhang H, Li MY, Wang Z, Feng Q, Huang J, Wang B (2020) Towards non-task-specific distillation of bert via sentence representation approximation. In: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pp 70–79
Xiong W, Li X, Iyer S, Du J, Lewis P, Wang WY, Mehdad Y, Yih S, Riedel S, Kiela D, et al. (2020) Answering complex open-domain questions with multi-hop dense retrieval. In: ICML 2020
Yang W, Xie Y, Lin A, Li X, Tan L, Xiong K, Li M, Lin J (2019) End-to-end open-domain question answering with bertserini. Proc NAACL 2019:72–77
Yang Z, Qi P, Zhang S, Bengio Y, Cohen W, Salakhutdinov R, Manning CD (2018) Hotpotqa: a dataset for diverse, explainable multi-hop question answering. In: Proceedings of EMNLP 2018, pp 2369–2380
Zhang Z, Zhao H, Wang R (2020) Machine reading comprehension: the role of contextualized language models and beyond. Computat Ling 1(1)
Zhao T, Xiaopeng L, Lee K (2021) Sparta: efficient open-domain question answering via sparse transformer matching retrieval. Proceedings of NAACL 2021:565–575
Zhu F, Lei W, Wang C, Zheng J, Poria S, Chua T-S (2021) Retrieving and reading: a comprehensive survey on open-domain question answering. arXiv preprint arXiv:2101.00774