A review: preprocessing techniques and data augmentation for sentiment analysis

Huu-Thanh Duong1, Tram-Anh Nguyen-Thi2
1Faculty of Information Technology, Ho Chi Minh City Open University, Ho Chi Minh City, Vietnam
2Department of Fundamental Studies, Ho Chi Minh City Open University, 97 Vo Van Tan, Ward 6, District 3, Ho Chi Minh City, Vietnam

Tóm tắt

AbstractIn literature, the machine learning-based studies of sentiment analysis are usually supervised learning which must have pre-labeled datasets to be large enough in certain domains. Obviously, this task is tedious, expensive and time-consuming to build, and hard to handle unseen data. This paper has approached semi-supervised learning for Vietnamese sentiment analysis which has limited datasets. We have summarized many preprocessing techniques which were performed to clean and normalize data, negation handling, intensification handling to improve the performances. Moreover, data augmentation techniques, which generate new data from the original data to enrich training data without user intervention, have also been presented. In experiments, we have performed various aspects and obtained competitive results which may motivate the next propositions.

Từ khóa


Tài liệu tham khảo

Hussein DME-DM. A survey on sentiment analysis challenges. J King Saud Univ Eng Sci. 2018;30(4):330–8.

Medhat W, Hassan A, Korashy H. Sentiment analysis algorithms and applications: a survey. Ain Shams Eng J. 2014;5(4):1093–113.

Soleymani M, Garcia D, Jou B, Schuller B, Chang S-F, Pantic M. A survey of multimodal sentiment analysis. Image Vis Comput. 2017;65:3–14.

Symeonidis S, Effrosynidis D, Arampatzis A. A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis. Expert Syst Appl. 2018;110:298–310.

Effrosynidis D, Symeonidis S, Arampatzis A. A Comparison of Pre-processing Techniques for Twitter Sentiment Analysis. In: Kamps J., Tsakonas G., Manolopoulos Y., Iliadis L., Karydis I. (eds) Research and Advanced Technology for Digital Libraries. TPDL. Lecture Notes in Computer Science, vol. 10450. Cham: Springer; 2017.

Fernández-Gavilanes M, Àlvarez-López T, Juncal-Martínez J, Costa-Montenegro E, González-Castaño FJ. “GTI: An Unsupervised Approach for Sentiment Analysis in Twitter,” in Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver; 2015. pp. 533–538.

Singh T, Kumari M. Role of text pre-processing in Twitter sentiment analysis. Procedia Comp Sci. 2016;89:549–54. https://doi.org/10.1016/j.procs.2016.06.095.

Jianqiang Z, Xiaolin G. Comparison research on text pre-processing methods on Twitter sentiment analysis. IEEE Access. 2017;5:2870–9. https://doi.org/10.1109/ACCESS.2017.2672677.

AL-Sharuee MT, Liu F, Pratama M. Sentiment analysis: an automatic contextual analysis and ensemble clustering approach and comparison. Data Knowl Eng. 2018;115:194–213.

Fernández-Gavilanes M, Juncal-Martínez J, García-Méndez S, Costa-Montenegro E, González-Castaño FJ. Creating emoji lexica from unsupervised sentiment analysis of their descriptions. Expert Syst Appl. 2018;103:74–91.

Wang H, Castanon JA. “Sentiment expression via emoticons on social media,” 2015 IEEE International Conference on Big Data (Big Data), Santa Clara. 2015; pp. 2404-2408, https://doi.org/10.1109/BigData.2015.7364034.

Sennrich R, Haddow B, Birch A. “Improving Neural Machine Translation Models with Monolingual Data,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, vol 1: Long Papers, Berlin. 2016; pp. 86–96, https://doi.org/10.18653/v1/P16-1009.

Sugiyama A, Yoshinaga N. “Data augmentation using back-translation for context-aware neural machine translation,” in Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019), Hong Kong. 2019; pp. 35–44, https://doi.org/10.18653/v1/D19-6504.

Fadaee M, Bisazza A, Monz C. “Data Augmentation for Low-Resource Neural Machine Translation,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vol 2: Short Papers. Vancouver. 2017; pp. 567–573, https://doi.org/10.18653/v1/P17-2090.

Kobayashi S. “Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans. 2018; pp. 452–457.

Azad HK, Deepak A. Query expansion techniques for information retrieval: a survey. Inf Process Manage. 2019;56(5):1698–735.

Şahin GG, Steedman M. “Data Augmentation via Dependency Tree Morphing for Low-Resource Languages,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels. 2018; pp. 5004–5009. https://doi.org/10.18653/v1/D18-1545.

Wei J, Zou K. “EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification,” in ICLR 2019-7th International Conference on Learning Representations, 2019.

Kim K. An improved semi-supervised dimensionality reduction using feature weighting: application to sentiment analysis. Expert Syst Appl. 2018;109:49–65.

Nguyen-Thi BT, Duong HT. A Vietnamese sentiment analysis system based on multiple classifiers with enhancing lexicon features. In: Duong T, Vo NS, Nguyen L, Vien QT, Nguyen VD, editors. Industrial networks and intelligent systems INISCOM, vol. 293., Lecture notes of the Institute for Computer Sciences, Social Informatics and Telecommunications EngineeringCham: Springer; 2019.

Nguyen-Nhat D-K, Duong H-T. One-Document Training for Vietnamese Sentiment Analysis. In: Tagarelli A, Tong H, editors. Computational Data and Social Networks, vol. 11917. Cham: Springer International Publishing; 2019. p. 189–200.

Xia R, Xu F, Zong C, Li Q, Qi Y, Li T. Dual sentiment analysis: considering two sides of one review. IEEE Trans Knowl Data Eng. 2015;27(8):2120–33. https://doi.org/10.1109/TKDE.2015.2407371.

Xia M, Kong X, Anastasopoulos A, Neubig G. Generalized Data Augmentation for Low-Resource Translation, in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence. 2019; pp. 5786–5796. https://doi.org/10.18653/v1/P19-1579.

Duong H-T, Truong Hoang V. “A Survey on the Multiple Classifier for New Benchmark Dataset of Vietnamese News Classification,” in 2019 11th International Conference on Knowledge and Smart Technology (KST), Phuket. 2019; pp. 23–28, https://doi.org/10.1109/KST.2019.8687509.