A multilingual offensive language detection method based on transfer learning from transformer fine-tuning model

Fatima-zahra El-Alami1, Said Ouatik El Alaoui1,2, Noureddine En Nahnahi1
1Laboratory of Informatics, Signals, Automatic and Cognitivism, FSDM, Sidi Mohamed Ben Abdellah University, Fez, Morocco
2Engineering Sciences Laboratory, National School of Applied Sciences, Ibn Tofail University, Kenitra, Morocco

Tài liệu tham khảo

Abdelali, A., Darwish, K., Durrani, N., Mubarak, H., 2016. Farasa: A Fast and Furious Segmenter for Arabic, in: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. Association for Computational Linguistics, San Diego, California, pp. 11–16. https://doi.org/10.18653/v1/n16-3003 Alami, 2020, LISAC FSDM-USMBA Team at SemEval-2020 Task 12: Overcoming AraBERT’s pretrain-finetune discrepancy for Arabic offensive language identification, Proc. Fourteenth Workshop Seman. Eval., 2080, 10.18653/v1/2020.semeval-1.275 Amini, 2010, Combining coregularization and consensus-based self-training for multilingual text categorization, 475 Antoun, W., Baly, F., Hajj, H., 2020. AraBERT: Transformer-based model for Arabic language understanding. arXiv preprint arXiv:2003.00104. Bel, N., Koster, C. H., & Villegas, M., 2003. Cross-lingual text categorization. In International Conference on Theory and Practice of Digital Libraries, Berlin, Heidelberg, pp. 126-139. Bentaallah, 2014, The use of wordnets for multilingual text categorization: A Comparative Study, ICWIT, 121 Che, W., Liu, Y., Wang, Y., Zheng, B., Liu, T., 2018. Towards better UD parsing: Deep contextualized word embeddings, ensemble, and treebank concatenation. CoNLL 2018 - SIGNLL Conf. Comput. Nat. Lang. Learn. Proc. CoNLL 2018 Shar. Task Multiling. Parsing from Raw Text to Univers. Depend. 55–64. https://doi.org/10.18653/v1/K18-2005 Conneau, 2019, Unsupervised Cross-lingual Representation Learning at Scale, 31 Dahou, 2016, Word embeddings and convolutional neural network for Arabic sentiment classification, 2418 Davidson, 2017, Automated hate speech detection and the problem of offensive language, 512 Devlin, 2019, BERT: Pre-training of deep bidirectional transformers for language understanding, 4171 El-Alami, F.-Z., El Alaoui, S.O., En-Nahnahi, N., 2020. Deep Neural Models and Retrofitting for Arabic Text Categorization. International Journal of Intelligent Information Technologies (IJIIT). 16, 74–86. https://doi.org/10.4018/ijiit.2020040104 ElJundi, O., Antoun, W., El Droubi, N., Hajj, H., El-Hajj, W., Shaban, K., 2019. hULMonA: The Universal Language Model in Arabic 68–77. https://doi.org/10.18653/v1/w19-4608 Elnagar, 2020, Arabic text classification using deep learning models, I Inform. Process. Manage., 57, 102 Gonalves, T., Quaresma, P., 2010. Multilingual text classification through combination of monolingual classifiers. In Proceedings of the 4th Workshop on Legal Ontologies and Artificial Intelligence Techniques. 605, 29–38. Howard, J., Ruder, S., 2018. Universal language model fine-tuning for text classification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 328–339. https://doi.org/10.18653/v1/p18-1031 R. Kapila Satvika. Text Categorization on Multiple Languages Based On Classification Technique International Journal of Computer Science and Information Technologies. 7 3 2016 1578 1581 Kumar, R., Ojha, A. K., Malmasi, S., Zampieri, M., 2018. Benchmarking aggression identification in social media. In Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018), Santa Fe, New Mexico, USA, pp. 1-11. Lai, 2015 Lample, G. and Conneau, A., 2019. Cross-lingual language model pretraining. Advances in Neural Information Processing Systems (NeurIPS 2019). 32. Lee, C. H., Yang, H. C., Ma, S. M., 2006. A novel multilingual text categorization system using latent semantic indexing. In First International Conference on Innovative Computing, Information and Control-Volume I (ICICIC'06), Beijing, China, pp. 503-506. https://doi.org/10.1109/icicic.2006.214 Liu, 2019, Bidirectional LSTM with attention mechanism and convolutional layer for text classification, Neurocomputing, 337, 325, 10.1016/j.neucom.2019.01.078 Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L. Stoyanov, V., 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Mandl, 2019, Overview of the hasoc track at fire 2019: Hate speech and offensive content identification in indo-european languages Mittal, 2015, Multilingual text classification, Int. J. Eng. Res. Technol. (IJERT), 4 Mubarak, H., Rashed, A., Darwish, K., Samih, Y., Abdelali, A., 2020. Arabic offensive language on twitter: Analysis and experiments. arXiv preprint arXiv:2004.02192. Nowak, 2017, LSTM recurrent neural networks for short text and sentiment classification, 553 Peters, 2018 Prajapati, B.P., Garg, S., Panchal, M.H., 2009. Automated Text Categorization with Machine Learning and its Application in Multilingual Text Categorization. National Conference on Advance Computing - NCAC09, Vallabh Vidyanagar, Anand, Gujarat, India, pp. 204–209. Rosenthal, S., Atanasova, P., Karadzhov, G., Zampieri, M., Nakov, P., 2020. A large-scale semi-supervised dataset for offensive language identification. arXiv preprint arXiv:2004.14454. Sanh, V., Debut, L., Chaumond, J., Wolf, T., 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. Vaswani, 2017, 5999 Zampieri, M., Malmasi, S., Nakov, P., Rosenthal, S., Farra, N., & Kumar, R. (2019). Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval). arXiv preprint arXiv:1903.08983. Zampieri, M., Nakov, P., Rosenthal, S., Atanasova, P., Karadzhov, G., Mubarak, H., Derczynski, L., Pitenis, Z., Çöltekin, Ç., 2020. SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020) 1425–1447. arXiv preprint arXiv:2006.07235. Zhou, C., Sun, C., Liu, Z., Lau, F., 2015. A C-LSTM neural network for text classification. arXiv preprint arXiv:1511.08630.