arHateDetector: detection of hate speech from standard and dialectal Arabic Tweets

Ramzi Khezzar1, Abdelrahman Moursi1, Zaher Al Aghbari1
1Department of Computer Science, University of Sharjah, Street, Sharjah, UAE

Tóm tắt

AbstractHate speech has become a phenomenon on social media platforms, such as Twitter. These websites and apps that were initially designed to facilitate our expression of free speech, are sometimes being used to spread hate towards each other. In the Arab region, Twitter is a very popular social media platform and thus the number of tweets that contain hate speech is increasing rapidly. Many tweets are written either in standard, dialectal Arabic, or mix. Existing work on Arabic hate speech are targeted towards either standard or single dialectal text, but not both. To fight hate speech more efficiently, in this paper, we conducted extensive experiments to investigate Arabic hate speech in tweets. Therefore, we propose a framework, called arHateDetector, that detects hate speech in the Arabic text of tweets. The proposed arHateDetector supports both standard and several dialectal Arabic. A large Arabic hate speech dataset, called arHateDataset, was compiled from several Arabic standard and dialectal tweets. The tweets are preprocessed to remove the unwanted content. We investigated the use of recent machine learning and deep learning models such as AraBERT to detect hate speech. All classification models used in the investigation are trained with the compiled dataset. Our experiments shows that AraBERT outperformed the other models producing the best performance across seven different datasets including the compiled arHateDataset with an accuracy of 93%. CNN and LinearSVC produced 88% and 89% respectively.

Từ khóa


Tài liệu tham khảo

Saeed MM, Al Aghbari Z. Artc: feature selection using association rules for text classification. Neural Comput Appl. 2022;34(24):22519–29.

Cambridge-Dictionary https://dictionary.cambridge.org/us/dictionary/english/hate-speech.

Statista-Inc: The Most Common Languages on the Internet, https://www.statista.com/statistics/262946/share-of-the-most-common-languages-on-the-internet. 2019.

Elzobi M, Al-Hamadi A, Al Aghbari Z, Dings L, Saeed A. Gabor wavelet recognition approach for off-line handwritten arabic using explicit segmentation. In: S. Choras, R. (ed.) Image Processing and Communications Challenges. Springer, Heidelberg 2014; pp. 245–254.

Dinges L, Al-Hamadi A, Elzobi M, Al Aghbari Z, Mustafa H. Offline automatic segmentation based recognition of handwritten arabic words. Int J Sign Process Image Processing Pattern Recogn. 2011;4(4):131–43.

Zampieri M, Malmasi S, Nakov P, Rosenthal S, Farra N, Kumar R. Predicting the type and target of offensive posts in social media. Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), p. 1415-1420. 2019.

Davidson T, Warmsley D, Macy M, Weber I. Automated hate speech detection and the problem of offensive language. In: Proceedings of the International AAAI Conference on Web and Social Media, vol. 11; 2017. p. 512–5.

Mulki H, Haddad H, Ali CB, Alshabani H. L-hsab: A levantine twitter dataset for hate speech and abusive language. In: Proceedings of the Third Workshop on Abusive Language Online. 2019. p. 111–8.

Mubarak H, Rashed A, Darwish K, Samih Y, Abdelali A. Arabic offensive language on Twitter: Analysis and experiments. In: Proceedings of the Sixth Arabic Natural Language Processing Workshop, pp. 126–135. Association for Computational Linguistics, Kyiv, Ukraine (Virtual). 2021.

Haddad H, Mulki H, Oueslati A. T-hsab: A tunisian hate speech and abusive dataset. In: International Conference on Arabic Language Processing, Springer. 2019; p. 251–63.

Boulouard Z, Ouaissa M, Ouaissa M. Machine learning for hate speech detection in arabic social media. In: Computational Intelligence in Recent Communication Networks. Springer, New York. 2022. p. 147–62.

Albadi N, Kurdi M, Mishra S. Are they our brothers? analysis and detection of religious hate speech in the arabic twittersphere. In: 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). 2018; p. 69–76.

Chowdhury AG, Didolkar A, Sawhney R, Shah R. Arhnet-leveraging community interaction for detection of religious hate speech in arabic. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, 2019. p. 273–80.

Alsafari S, Sadaoui S, Mouhoub M. Hate and offensive speech detection on arabic social media. Online Soc Netw Media. 2020;19: 100096.

Anezi FYA. Arabic hate speech detection using deep recurrent neural networks. Appl Sci. 2022;12(12):6010.

Aldjanabi W, Dahou A, Al-qaness MA, Elaziz MA, Helmi AM, Damaševičius R. Arabic offensive and hate speech detection using a cross-corpora multi-task learning model. Informatics. 2021;8:69.

Husain F, Uzuner O. Investigating the effect of preprocessing arabic text on offensive language and hate speech detection. Trans Asian Low-Resource Language Inform Process. 2022;21(4):1–20.

Alsafari S, Sadaoui S. Semi-supervised self-learning for arabic hate speech detection. In: 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2021. p. 863–8.

Mostafa A, Mohamed O, Ashraf A. Gof at arabic hate speech 2022: breaking the loss function convention for data-imbalanced arabic offensive text detection. In: Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur’an QA and Fine-Grained Hate Speech Detection, 2022. p. 167–75.

Mursi KT, Alahmadi MD, Alsubaei FS, Alghamdi AS. Detecting islamic radicalism arabic tweets using natural language processing. IEEE Access. 2022;10:72526–34.

Omar A, Mahmoud TM, Abd-El-Hafeez T, Mahfouz A. Multi-label arabic text classification in online social networks. Inform Syst. 2021;100: 101785.

AbdelHamid M, Jafar A, Rahal Y. Levantine hate speech detection in twitter. Soc Netw Anal Mining. 2022;12(1):1–13.

Bennessir MA, Rhouma M, Haddad H, Fourati C. icompass at arabic hate speech 2022: Detect hate speech using qrnn and transformers. In: Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur’an QA and Fine-Grained Hate Speech Detection, pp. 176–180; 2022.

Dataset: Arabic Levantine Hate Speech. https://dictionary.cambridge.org/us/dictionary/english/hate-speech.

Dataset: Hate Speech Detection in Arabic Twittersphere. https://github.com/raghadsh/Arabic-Hate-speech

Dataset: Religious Hate Speech Detection for Arabic Tweets. https://github.com/nuhaalbadi/Arabic_hatespeech

Dataset: Hate and Offensive Speech Detection on Arabic Social Media. https://github.com/sbalsefri/ArabicHateSpeechDataset.

Dataset: AraCOVID19-MFH: Arabic COVID-19 Multi-label Fake News Hate Speech Detection. https://github.com/MohamedHadjAmeur/AraCOVID19MFH.

Dataset: Multi-lingual Hate Speech. https://www.kaggle.com/datasets/wajidhassanmoosa/multilingual-hatespeech-dataset?resource=download.

Ousidhoum N, Lin Z, Zhang H, Song Y, Yeung D-Y. Multilingual and multi-aspect hate speech analysis. arXiv preprint arXiv:1908.11049. 2019.

Alshalan R, Al-Khalifa H. A deep learning approach for automatic hate speech detection in the saudi twittersphere. Appl Sci. 2020;10(23):8614.

Stop-Words: List of Arabic Stop Words on Github. https://github.com/nuhaalbadi/Arabic_hatespeech/blob/master/stop_words.csv.

El Mahdaouy A, El Alaoui SO, Gaussier E. Word-embedding-based pseudo-relevance feedback for arabic information retrieval. J inform Sci. 2019;45(4):429–42.

Kim Y. Convolutional neural networks for sentence classification. CoRR arXiv:abs/1408.5882. 2014.

Alkouz B, Al Aghbari Z, Al-Garadi MA, Sarker A. Deepluenza: Deep learning for influenza detection from twitter. Expert Syst Appl. 2022;198: 116845.

Antoun W, Baly F, Hajj H. Arabert: Transformer-based model for arabic language understanding. arXiv preprint arXiv:2003.00104. 2020.

NumPy: The Fundamental Package for Scientific Computing with Python. https://numpy.org/

NLTK: Natural Language Toolkit. https://www.nltk.org/.

scikit-learn: Tools for Predictive Data Analysis. https://scikit-learn.org/stable/.

TensorFlow: Open Source Platform for Machine Learning. https://www.tensorflow.org/overview.

Keras: Deep Learning API Written in Python. https://keras.io/api/.

AraBERT: Arabic Pretrained Language Model Based on Google’s BERT. https://github.com/aub-mind/arabert#AraBERT.