FLICs (Facebook Language Informal Corpus): a novel dataset for informal language

Francis Rakotomalala1, Aimé Richard Hajalalaina1, Manda Vy Ravonimanantsoa Ndaohialy2, Anselme Andriavelonera Alexandre1, Andriatina H. Ranaivoson3
1ICT and Informatics for Development, University of Fianarantsoa, Fianarantsoa, Madagascar
2Telecommunications, Automation, Signal and Image, University of Antananarivo, Antananarivo, Madagascar
3Mathematics and Structures, University of Antananarivo, Antananarivo, Madagascar

Tóm tắt

This article introduces FLICs, a novel dataset designed for modeling informal language. Predominantly composed of text in Malagasy and French, FLICs were collected from Facebook and cleansed using Python scraping techniques. The aim is to address the lack of linguistic diversity in existing datasets by offering over 800,000 informal texts, enabling an understanding of current linguistic trends in informal communication. FLICs stands out for its inclusion of features characteristic of informal communication, such as abbreviations, dialects, slang, emoticons, and keywords. Moreover, it highlights the less-studied Malagasy language, used in an extremely informal manner and interwoven with French through code-switching. This linguistic variety opens new avenues for research in NLP, particularly in text comprehension, generation, and classification. To validate the new dataset, we employed FastText, a word embedding model, to capture word semantics within the corpus. Analyses demonstrated data relevance. Semantic relationships between words were faithfully captured using pre-trained FastText word embeddings. The informal word contexts were explored through clustering word vectors obtained using K-means and PCA. Subsequently, validation extended to utilizing pre-trained FastText embeddings in text generation with an LSTM model. This experiment confirmed the usefulness of embeddings in consistently and appropriately generating informal texts. In summary, these contributions enrich the available resources for research in natural language processing within the realm of informal language.

Tài liệu tham khảo

Baaqeel H., Zagrouba, R., et al.: Hybrid SMS spam filtering system using machine learning techniques. In: 2020 21st International Arab Conference on Information Technology (ACIT), pp. 1–8. (2020) Sajedi, H., Parast, G.Z., Akbari, F.: Sms spam filtering using machine learning techniques: A survey. Mach. Learn. Res. 1(1), 1–14 (2016) Twitter Sentiment Classification using Distant Supervision - Google Scholar. https://scholar.google.com/scholar?hl=fr&as_sdt=0%2C5&q=Twitter+Sentiment+Classification+using+Distant+Supervision&btnG= (consulté le 4 mars 2023). Pak A., Paroubek, P.: Twitter as a corpus for sentiment analysis and opinion mining. In: LREc, vol. 10, pp. 1320–1326 (2010). Danescu-Niculescu-Mizil, C., West, R., Jurafsky, D., Leskovec, J., Potts, C.: No country for old members: User lifecycle and linguistic change in online communities. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 307–318 (2013). Li, J., Galley, M., Brockett, C., Spithourakis, G.P., Gao, J., Dolan, B.: A persona-based neural conversation model. ArXiv Prepr. ArXiv160306155, (2016). Zhang, S., Dinan, E., Urbanek, J., Szlam, A., Kiela, D., Weston, J.: Personalizing dialogue agents: I have a dog, do you have pets too? ArXiv Prepr. ArXiv180107243 (2018). Liu Q., et al.: You impress me: Dialogue generation via mutual persona perception, ArXiv Prepr. ArXiv200405388 (2020). Baumgartner, J., Zannettou, S., Keegan, B., Squire, M., Blackburn, J.: The pushshift reddit dataset. In: Proceedings of the International AAAI Conference on Web and Social Media, vol. 14, pp. 830–839 (2020). Setty, V., Rekve, E.: Truth be told: Fake news detection using user reactions on reddit. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 3325–3328 (2020) Lample, G., Conneau, A., Denoyer, L., Ranzato, M.: Unsupervised machine translation using monolingual corpora only. ArXiv Prepr. ArXiv171100043 (2017) Barbieri, F., Camacho-Collados, J., Neves, L., Espinosa-Anke, L.: Tweeteval: Unified benchmark and comparative evaluation for tweet classification. ArXiv Prepr. ArXiv201012421 (2020) Solorio, T., et al.: Overview for the first shared task on language identification in code-switched data. In: Proceedings of the First Workshop on Computational Approaches to Code Switching, pp. 62–72 (2014). Ross, B., Rist, M., Carbonell, G., Cabrera, B., Kurowsky, N., Wojatzki, M.: Measuring the reliability of hate speech annotations: The case of the European refugee crisis. ArXiv Prepr. ArXiv170108118 (2017) Klinger, R., De Clercq, O., Mohammad, S. M., Balahur, A.: IEST: WASSA-2018 implicit emotions shared task. ArXiv Prepr. ArXiv180901083 (2018). Rakotoson, H. et al.: Creation of a reference corpus for the Malagasy language. In: Proceedings of the 10th Language Resources and Evaluation Conference (LREC 2016), Portorož, Slovenia, pp. 1733–1740 (2016) Ralison, A. et al.: Annotation and analysis of the Corpus of Malagasy Informal Texts (COTMI). In: Proceedings of the Workshop on Language Technologies for African Languages (AfLaT 2017), Valencia, Spain, pp. 39–44. (2017) Razafindramanana, L. et al.: Building and analyzing a corpus of Malagasy texts from the web. In: Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing (BSNLP 2019), Florence, Italy, pp. 62–70 (2019). Rakotoarisoa, J.-B., et al.: Building and analyzing a corpus of Malagasy journalistic texts. In: Proceedings of 4th Workshop Computational Linguistics Ural. Languages CoLU 2019 Turku Finl. pp 102–108 (2019) Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 26 (2013) Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation. ArXiv Prepr. ArXiv13094168 (2013) Kim, K., et al.: A deep learning approach for sentiment analysis in informal language (2019). Marcoux, R., Richard, L., Wolff, A.: Estimation des populations francophones dans le monde en. Sources Démarches Méthodologiques (2022) Wolf, A.: La langue française dans le monde. Organisation internationale de la Francophonie (2014). Ling, W., Dyer, C., Black, A.W., Trancoso, I.: Two/too simple adaptations of word2vec for syntax problems. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1299–1304 (2015) Bustamam, A., Tasman, H., Yuniarti, N., Frisca, F., Mursidah, I.: Application of k-means clustering algorithm in grouping the DNA sequences of hepatitis B virus (HBV). In: AIP Conference Proceedings, AIP Publishing (2017)