Cải thiện phân loại email thông qua phương pháp tiền xử lý dữ liệu nâng cao

Spatial Information Research - Tập 29 - Trang 247-255 - 2021
B. Aruna Kumara1, Mallikarjun M. Kodabagi1, Tanupriya Choudhury2, Jung-Sup Um3
1CSE Department, REVA University, Bangalore, India
2Deparment of Informatics, School of Computer Science, University of Petroleum and Energy Studies (UPES), Dehradun, India
3Department of Geography, College of Social Sciences, Kyungpook National University, Daegu, South Korea

Tóm tắt

Email đã trở thành một trong những hình thức giao tiếp được sử dụng rộng rãi nhất, dẫn đến sự gia tăng theo cấp số nhân trong số lượng email nhận được và tạo ra một gánh nặng khổng lồ cho các phương pháp phân loại email hiện có. Việc áp dụng phương pháp phân loại trên dữ liệu thô có thể làm giảm hiệu suất của các thuật toán phân loại. Do đó, dữ liệu cần được chuẩn bị để nâng cao hiệu suất của các bộ phân loại học máy. Bài báo này đề xuất một phương pháp tiền xử lý dữ liệu nâng cao cho phân loại email đa danh mục. Mô hình đề xuất sẽ loại bỏ chữ ký trong email. Hơn nữa, các ký tự đặc biệt và từ ngữ không mong muốn sẽ được loại bỏ bằng các phương pháp tiền xử lý khác nhau như loại bỏ từ dừng, loại bỏ từ dừng nâng cao và stemming. Mô hình đề xuất được đánh giá bằng cách sử dụng nhiều bộ phân loại như Multi-Nominal Naïve Bayes, Bộ Phân loại Vector Hỗ trợ Tuyến tính, Hồi quy Logistic và Rừng ngẫu nhiên. Các kết quả cho thấy phương pháp tiền xử lý dữ liệu được đề xuất cho phân loại email vượt trội hơn so với phương pháp hiện có.

Từ khóa

#phân loại email #tiền xử lý dữ liệu #học máy #Multi-Nominal Naïve Bayes #Rừng ngẫu nhiên

Tài liệu tham khảo

I. The Radicati Group. (2015). Email statistics report, 2015–2019. Email Statistics Report, 44, 4. García, S., Luengo, J., & Herrera, F. (2015). Data preprocessing in data mining. Intelligent Systems Reference Library book series (ISRL, vol. 72). https://doi.org/10.1007/978-3-319-10247-4. Zhang, Q., Zhang, S., & Yang, C. (2003). Dara prepartion for data mining. Applied Artificial Intelligence, 17(5–6), 375–381. https://doi.org/10.1080/08839510390219264. López, V., Fernández, A., García, S., Palade, V., & Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences (Ny), 250, 113–141. https://doi.org/10.1016/j.ins.2013.07.007. Krawczyk, B. (2016). Learning from imbalanced data: Open challenges and future directions. Progress in Artificial Intelligence, 5(4), 221–232. https://doi.org/10.1007/s13748-016-0094-0. Charte, F., Rivera, A. J., del Jesus, M. J., & Herrera, F. (2019). Tackling multilabel imbalance through label decoupling and data resampling hybridization. Neurocomputing, 326–327, 110–122. https://doi.org/10.1016/j.neucom.2017.01.118. Herrera, F., et al. (2016). Multiple instance learning: Foundations and algorithms (pp. 1–233). https://doi.org/10.1007/978-3-319-47759-6. Uysal, A. K., & Gunal, S. (2014). The impact of preprocessing on text classification. Information Processing and Management, 50(1), 104–112. https://doi.org/10.1016/j.ipm.2013.08.006. Chandrasekar, P., & Qian, K. (2016). The impact of data preprocessing on the performance of a Naïve Bayes classifier. Proceedings of International Computer Software and Applications Conference, 2, 618–619. https://doi.org/10.1109/COMPSAC.2016.205. Krouska, A., Troussas, C., & Virvou, M. (2016). The effect of preprocessing techniques on Twitter sentiment analysis. IISA 2016 The International Conference on Information, Intelligence, Systems and Applications (pp. 1–5). https://doi.org/10.1109/IISA.2016.7785373. García, S., Luengo, J., & Herrera, F. (2016). Tutorial on practical tips of the most influential data preprocessing algorithms in data mining. Knowledge-Based System, 98, 1–29. https://doi.org/10.1016/j.knosys.2015.12.006. Pradha, S., Halgamuge, M. N., & Tran Quoc Vinh, N. (2019). Effective text data preprocessing technique for sentiment analysis in social media data. Proceedings 2019 The 11th International Conference on Knowledge and Systems Engineering. KSE (pp. 1–8). https://doi.org/10.1109/KSE.2019.8919368. Liu, W., Liu, S., Gu, Q., Chen, J., Chen, X., & Chen, D. (2016). Empirical studies of a two-stage data preprocessing approach for software fault prediction. IEEE Transactions on Reliability, 65(1), 38–53. https://doi.org/10.1109/TR.2015.2461676. Markov, Z., & Larose, D. T. (2007). Preprocessing for web usage mining. Data Mining Web, 3(April), 156–176. https://doi.org/10.1002/9780470108093.ch7. Kamiran, F., & Calders, T. (2012). Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems, 33, 1. Daras, G., Agard, B., & Penz, B. (2018). A spatial data pre-processing tool to improve the quality of the analysis and to reduce preparation duration. Computers and Industrial Engineering, 119, 219–232. https://doi.org/10.1016/j.cie.2018.03.025. Duan, K., Keerthi, S. S., Chu, W., Shevade, S. K., & Poo, A. N. (2003). Multi-category classification by soft-max combination of binary classifiers. Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) (Vol. 2709, pp. 125–134). https://doi.org/10.1007/3-540-44938-8_13. Wang, S., & Yao, X. (2012). “Multiclass imbalance problems: Analysis and potential solutions. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 42(4), 1119–1130. https://doi.org/10.1109/TSMCB.2012.2187280. Luque, A., Carrasco, A., Martín, A., & de las Heras, A. (2019). The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognition, 91, 216–231. https://doi.org/10.1016/j.patcog.2019.02.023. Charte, F. (2020). A comprehensive and didactic review on multilabel learning software tools. IEEE Access, 8, 50330–50354. https://doi.org/10.1109/ACCESS.2020.2979787. Alhaj, Y. A., Xiang, J., Zhao, D., Al-Qaness, M. A. A., AbdElaziz, M., & Dahou, A. (2019). A study of the effects of stemming strategies on Arabic document classification. IEEE Access, 7, 32664–32671. https://doi.org/10.1109/ACCESS.2019.2903331. García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J. M., & Herrera, F. (2016). Big data preprocessing: methods and prospects. Big Data Analytics, 1(1), 1–22. https://doi.org/10.1186/s41044-016-0014-0. Kanan, T., & Fox, E. A. (2016). Automated arabic text classification with P-Stemmer, machine learning, and a tailored news article taxonomy. Journal of the Association for Information Science and Technology, 67(11), 2667–2683. https://doi.org/10.1002/asi.23609. Gunal, S., & Edizkan, R. (2008). Subspace based feature selection for pattern recognition. Information Sciences, 178(19), 3716–3726. https://doi.org/10.1016/j.ins.2008.06.001. El Aassal, A., Baki, S., Das, A., & Verma, R. M. (2020). An in-depth benchmarking and evaluation of phishing detection research for security needs. IEEE Access, 8, 22170–22192. https://doi.org/10.1109/ACCESS.2020.2969780. Coussement, K., & Van den Poel, D. (2008). Improving customer complaint management by automatic email classification using linguistic style features as predictors. Decision Support Systems, 44(4), 870–882. https://doi.org/10.1016/j.dss.2007.10.010. Gomez, J. C., Boiy, E., & Moens, M. F. (2012). Highly discriminative statistical features for email classification. Knowledge and Information Systems, 31(1), 23–53. Forman, G. (2000). An extensive empirical study of feature selection metrics for text classification George. Journal of Machine Learning Research, 1, 1289–1305. https://doi.org/10.1162/153244303322753670. Setiyaningrum, Y. D., Herdajanti, A. F., Supriyanto, C., & Muljono. (2019). Classification of twitter contents using chi-square and K-nearest neighbour algorithm. Proceedings 2019 International Seminar on Application for Technology of Information and Communication Industry 4.0: Retrospect, Prospect, and Challenges, iSemantic (pp. 78–81). https://doi.org/10.1109/ISEMANTIC.2019.8884290. Parmar, P. S., Biju, P. K., Shankar, M., & Kadiresan, N. (2018). Multiclass text classification and analytics for improving customer support response through different classifiers. 2018 International Conference on Advanced Informatics for Computing ICACCI (pp. 538–542). https://doi.org/10.1109/ICACCI.2018.8554881. Li, H., Qi, F., & Wang, S. (2005). A comparison of model selection methods for multi-class support vector machines. Lecture notes in computer science, (Vol. 3483, no. IV, pp. 1140–1148). https://doi.org/10.1007/11424925_119. Li, T., Zhu, S., & Ogihara, M. (2006). Using discriminant analysis for multi-class classification: An experimental investigation. Knowledge and Information Systems, 10(4), 453–472. https://doi.org/10.1007/s10115-006-0013-y. Ham, J. S., Chen, Y., Crawford, M. M., & Ghosh, J. (2005). Investigation of the random forest framework for classification of hyperspectral data. IEEE Transactions on Geoscience and Remote Sensing, 43(3), 492–501. https://doi.org/10.1109/TGRS.2004.842481. Prinzie, A., & Van den Poel, D. (2008). Random forests for multiclass classification: Random multinomial logit. Expert Systems with Applications, 34(3), 1721–1732. https://doi.org/10.1016/j.eswa.2007.01.029. Van Leeuwen, D. A. & Brümmer, N. (2006). Channel-dependent GMM and multi-class Logistic Regression models for language recognition. IEEE Odyssey 2006 Work Speaker and Language Recognition (pp. 1–8).https://doi.org/10.1109/ODYSSEY.2006.248094. BinHuang, G., Zhou, H., Ding, X., & Zhang, R. (2012). Extreme learning machine for regression and multiclass classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B Cybernetics, 42(2), 513–529. https://doi.org/10.1109/TSMCB.2011.2168604. Mujtaba, G., Shuib, L., Raj, R. G., Majeed, N., & Al-Garadi, M. A. (2017). Email classification research trends: Review and open Issues. IEEE Access, 5, 9044–9064. https://doi.org/10.1109/ACCESS.2017.2702187.