Feature Selection in Imbalanced Data

Annals of Data Science - Tập 10 - Trang 1527-1541 - 2022
Firuz Kamalov1, Fadi Thabtah2, Ho Hon Leung3
1Canadian University of Dubai, Dubai, UAE
2Manukau Institute of Technology, Manukau, New Zealand
3UAE University, Al Ain, UAE

Tóm tắt

The traditional feature selection methods are not suitable for imbalanced data as they tend to be biased towards the majority class. This problem is particularly acute in the field of medical diagnostics and fraud detection where the class distribution is highly skewed. In this paper, we propose a novel filter approach using decision tree-based $$F_1$$ -score. The $$F_1$$ -score incorporates the accuracy with respect to the minority class data and hence is a good measure in the case of imbalanced data. In the proposed implementation, the $$F_1$$ -score is calculated based on a 1-dimensional decision tree classifier resulting in a fast and effective feature evaluation method. Numerical experiments confirm that the proposed method achieves robust dimensionality reduction and accuracy results. In addition, the low computational complexity of the algorithm makes it a practical choice for big data applications.

Tài liệu tham khảo

Olson DL, Shi Y, Shi Y (2007) Introduction to business data mining, vol 10. McGraw-Hill/Irwin, New York, pp 2250–2254 Shi Y, Tian Y, Kou G, Peng Y, Li J (2011) Optimization based data mining: theory and applications. Springer Science & Business Media, Berlin Tien JM (2017) Internet of things, real-time decision making, and artificial intelligence. Ann Data Sci 4(2):149–178 Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40(1):16–28 Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Prog Artif Intell 5(4):221–232 Thabtah F, Hammoud S, Kamalov F, Gonsalves A (2020) Data imbalance in classification: experimental evaluation. Inf Sci 513:429–441 Maldonado S, Weber R, Famili F (2014) Feature selection for high-dimensional class-imbalanced data sets using Support Vector Machines. Inf Sci 286:228–246 Moayedikia A, Ong KL, Boo YL, Yeoh WG, Jensen R (2017) Feature selection for high dimensional imbalanced class data using harmony search. Eng Appl Artif Intell 57:38–49 Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239 Mani I, Zhang I (2003) kNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of workshop on learning from imbalanced datasets, vol 126 Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357 Kamalov F (2020) Kernel density estimation based sampling for imbalanced class distribution. Inf Sci 512:1192–1201 Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2013) A review of feature selection methods on synthetic data. Knowl Inf Syst 34(3):483–519 Majeed A (2019) Improving time complexity and accuracy of the machine learning algorithms through selection of highly weighted top k features from complex datasets. Ann Data Sci 6(4):599–621 Kamalov F, Thabtah F (2017) A feature selection method based on ranked vector scores of features for classification. Ann Data Sci 4(4):483–502 Thabtah F, Kamalov F, Rajab K (2018) A new computational intelligence approach to detect autistic features for autism screening. Int J Med Inf 117:112–124 Zheng Z, Wu X, Srihari R (2004) Feature selection for text categorization on imbalanced data. ACM Sigkdd Explor Newslett 6(1):80–89 Yang P, Liu W, Zhou BB, Chawla S, Zomaya AY (2013) Ensemble-based wrapper methods for feature selection and class imbalance learning. Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 544–555 Yijing L, Haixiang G, Xiao L, Yanan L, Jinling L (2016) Adapted ensemble classification algorithm based on multiple classifier system and feature selection for classifying multi-class imbalanced data. Knowl-Based Syst 94:88–104 Kamalov F (2018) Sensitivity analysis for feature selection. In: 2018 17th IEEE international conference on machine learning and applications (ICMLA). IEEE, pp 1466–1470 Du LM, Xu Y, Zhu H (2015) Feature selection for multi-class imbalanced data sets based on genetic algorithm. Ann Data Sci 2(3):293–300 Thabtah F, Kamalov F (2017) Phishing detection: a case analysis on classifiers with rules using machine learning. J Inf Knowl Manage 16(04):1750034 Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27(8):861–874 Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830 Lemaitre G, Nogueira F, Aridas CK (2017) Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J Mach Learn Res 18(1):559–563 Guyon I, Gunn S, Hur AB, Dror G (2006) Design and analysis of the NIPS2003 challenge. Feature Extraction. Springer, Berlin, pp 237–263 Dua D, Graff C (2019) UCI machine learning repository [http://archive.ics.uci.edu/ml]. University of California, School of Information and Computer Science, Irvine, CA