A review: preprocessing techniques and data augmentation for sentiment analysis

Springer Science and Business Media LLC - Tập 8 Số 1 - 2021

Huu-Thanh Duong¹, Tram-Anh Nguyen-Thi²

¹Faculty of Information Technology, Ho Chi Minh City Open University, Ho Chi Minh City, Vietnam

²Department of Fundamental Studies, Ho Chi Minh City Open University, 97 Vo Van Tan, Ward 6, District 3, Ho Chi Minh City, Vietnam

Tóm tắt

AbstractIn literature, the machine learning-based studies of sentiment analysis are usually supervised learning which must have pre-labeled datasets to be large enough in certain domains. Obviously, this task is tedious, expensive and time-consuming to build, and hard to handle unseen data. This paper has approached semi-supervised learning for Vietnamese sentiment analysis which has limited datasets. We have summarized many preprocessing techniques which were performed to clean and normalize data, negation handling, intensification handling to improve the performances. Moreover, data augmentation techniques, which generate new data from the original data to enrich training data without user intervention, have also been presented. In experiments, we have performed various aspects and obtained competitive results which may motivate the next propositions.

Từ khóa

Tài liệu tham khảo

Hussein DME-DM. A survey on sentiment analysis challenges. J King Saud Univ Eng Sci. 2018;30(4):330–8.

Medhat W, Hassan A, Korashy H. Sentiment analysis algorithms and applications: a survey. Ain Shams Eng J. 2014;5(4):1093–113.

Soleymani M, Garcia D, Jou B, Schuller B, Chang S-F, Pantic M. A survey of multimodal sentiment analysis. Image Vis Comput. 2017;65:3–14.

Symeonidis S, Effrosynidis D, Arampatzis A. A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis. Expert Syst Appl. 2018;110:298–310.

Effrosynidis D, Symeonidis S, Arampatzis A. A Comparison of Pre-processing Techniques for Twitter Sentiment Analysis. In: Kamps J., Tsakonas G., Manolopoulos Y., Iliadis L., Karydis I. (eds) Research and Advanced Technology for Digital Libraries. TPDL. Lecture Notes in Computer Science, vol. 10450. Cham: Springer; 2017.

Fernández-Gavilanes M, Àlvarez-López T, Juncal-Martínez J, Costa-Montenegro E, González-Castaño FJ. “GTI: An Unsupervised Approach for Sentiment Analysis in Twitter,” in Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver; 2015. pp. 533–538.

Singh T, Kumari M. Role of text pre-processing in Twitter sentiment analysis. Procedia Comp Sci. 2016;89:549–54. https://doi.org/10.1016/j.procs.2016.06.095.

Jianqiang Z, Xiaolin G. Comparison research on text pre-processing methods on Twitter sentiment analysis. IEEE Access. 2017;5:2870–9. https://doi.org/10.1109/ACCESS.2017.2672677.

AL-Sharuee MT, Liu F, Pratama M. Sentiment analysis: an automatic contextual analysis and ensemble clustering approach and comparison. Data Knowl Eng. 2018;115:194–213.

Fernández-Gavilanes M, Juncal-Martínez J, García-Méndez S, Costa-Montenegro E, González-Castaño FJ. Creating emoji lexica from unsupervised sentiment analysis of their descriptions. Expert Syst Appl. 2018;103:74–91.

Wang H, Castanon JA. “Sentiment expression via emoticons on social media,” 2015 IEEE International Conference on Big Data (Big Data), Santa Clara. 2015; pp. 2404-2408, https://doi.org/10.1109/BigData.2015.7364034.

Sennrich R, Haddow B, Birch A. “Improving Neural Machine Translation Models with Monolingual Data,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, vol 1: Long Papers, Berlin. 2016; pp. 86–96, https://doi.org/10.18653/v1/P16-1009.

Sugiyama A, Yoshinaga N. “Data augmentation using back-translation for context-aware neural machine translation,” in Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019), Hong Kong. 2019; pp. 35–44, https://doi.org/10.18653/v1/D19-6504.

Fadaee M, Bisazza A, Monz C. “Data Augmentation for Low-Resource Neural Machine Translation,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vol 2: Short Papers. Vancouver. 2017; pp. 567–573, https://doi.org/10.18653/v1/P17-2090.

Kobayashi S. “Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans. 2018; pp. 452–457.

Azad HK, Deepak A. Query expansion techniques for information retrieval: a survey. Inf Process Manage. 2019;56(5):1698–735.

Şahin GG, Steedman M. “Data Augmentation via Dependency Tree Morphing for Low-Resource Languages,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels. 2018; pp. 5004–5009. https://doi.org/10.18653/v1/D18-1545.

Wei J, Zou K. “EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification,” in ICLR 2019-7th International Conference on Learning Representations, 2019.

Kim K. An improved semi-supervised dimensionality reduction using feature weighting: application to sentiment analysis. Expert Syst Appl. 2018;109:49–65.

Nguyen-Thi BT, Duong HT. A Vietnamese sentiment analysis system based on multiple classifiers with enhancing lexicon features. In: Duong T, Vo NS, Nguyen L, Vien QT, Nguyen VD, editors. Industrial networks and intelligent systems INISCOM, vol. 293., Lecture notes of the Institute for Computer Sciences, Social Informatics and Telecommunications EngineeringCham: Springer; 2019.

Nguyen-Nhat D-K, Duong H-T. One-Document Training for Vietnamese Sentiment Analysis. In: Tagarelli A, Tong H, editors. Computational Data and Social Networks, vol. 11917. Cham: Springer International Publishing; 2019. p. 189–200.

Xia R, Xu F, Zong C, Li Q, Qi Y, Li T. Dual sentiment analysis: considering two sides of one review. IEEE Trans Knowl Data Eng. 2015;27(8):2120–33. https://doi.org/10.1109/TKDE.2015.2407371.

Xia M, Kong X, Anastasopoulos A, Neubig G. Generalized Data Augmentation for Low-Resource Translation, in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence. 2019; pp. 5786–5796. https://doi.org/10.18653/v1/P19-1579.

Duong H-T, Truong Hoang V. “A Survey on the Multiple Classifier for New Benchmark Dataset of Vietnamese News Classification,” in 2019 11th International Conference on Knowledge and Smart Technology (KST), Phuket. 2019; pp. 23–28, https://doi.org/10.1109/KST.2019.8687509.

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA