Feature Selection in Imbalanced Data
Tóm tắt
The traditional feature selection methods are not suitable for imbalanced data as they tend to be biased towards the majority class. This problem is particularly acute in the field of medical diagnostics and fraud detection where the class distribution is highly skewed. In this paper, we propose a novel filter approach using decision tree-based
$$F_1$$
-score. The
$$F_1$$
-score incorporates the accuracy with respect to the minority class data and hence is a good measure in the case of imbalanced data. In the proposed implementation, the
$$F_1$$
-score is calculated based on a 1-dimensional decision tree classifier resulting in a fast and effective feature evaluation method. Numerical experiments confirm that the proposed method achieves robust dimensionality reduction and accuracy results. In addition, the low computational complexity of the algorithm makes it a practical choice for big data applications.
Tài liệu tham khảo
Olson DL, Shi Y, Shi Y (2007) Introduction to business data mining, vol 10. McGraw-Hill/Irwin, New York, pp 2250–2254
Shi Y, Tian Y, Kou G, Peng Y, Li J (2011) Optimization based data mining: theory and applications. Springer Science & Business Media, Berlin
Tien JM (2017) Internet of things, real-time decision making, and artificial intelligence. Ann Data Sci 4(2):149–178
Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40(1):16–28
Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Prog Artif Intell 5(4):221–232
Thabtah F, Hammoud S, Kamalov F, Gonsalves A (2020) Data imbalance in classification: experimental evaluation. Inf Sci 513:429–441
Maldonado S, Weber R, Famili F (2014) Feature selection for high-dimensional class-imbalanced data sets using Support Vector Machines. Inf Sci 286:228–246
Moayedikia A, Ong KL, Boo YL, Yeoh WG, Jensen R (2017) Feature selection for high dimensional imbalanced class data using harmony search. Eng Appl Artif Intell 57:38–49
Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239
Mani I, Zhang I (2003) kNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of workshop on learning from imbalanced datasets, vol 126
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Kamalov F (2020) Kernel density estimation based sampling for imbalanced class distribution. Inf Sci 512:1192–1201
Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2013) A review of feature selection methods on synthetic data. Knowl Inf Syst 34(3):483–519
Majeed A (2019) Improving time complexity and accuracy of the machine learning algorithms through selection of highly weighted top k features from complex datasets. Ann Data Sci 6(4):599–621
Kamalov F, Thabtah F (2017) A feature selection method based on ranked vector scores of features for classification. Ann Data Sci 4(4):483–502
Thabtah F, Kamalov F, Rajab K (2018) A new computational intelligence approach to detect autistic features for autism screening. Int J Med Inf 117:112–124
Zheng Z, Wu X, Srihari R (2004) Feature selection for text categorization on imbalanced data. ACM Sigkdd Explor Newslett 6(1):80–89
Yang P, Liu W, Zhou BB, Chawla S, Zomaya AY (2013) Ensemble-based wrapper methods for feature selection and class imbalance learning. Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 544–555
Yijing L, Haixiang G, Xiao L, Yanan L, Jinling L (2016) Adapted ensemble classification algorithm based on multiple classifier system and feature selection for classifying multi-class imbalanced data. Knowl-Based Syst 94:88–104
Kamalov F (2018) Sensitivity analysis for feature selection. In: 2018 17th IEEE international conference on machine learning and applications (ICMLA). IEEE, pp 1466–1470
Du LM, Xu Y, Zhu H (2015) Feature selection for multi-class imbalanced data sets based on genetic algorithm. Ann Data Sci 2(3):293–300
Thabtah F, Kamalov F (2017) Phishing detection: a case analysis on classifiers with rules using machine learning. J Inf Knowl Manage 16(04):1750034
Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27(8):861–874
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830
Lemaitre G, Nogueira F, Aridas CK (2017) Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J Mach Learn Res 18(1):559–563
Guyon I, Gunn S, Hur AB, Dror G (2006) Design and analysis of the NIPS2003 challenge. Feature Extraction. Springer, Berlin, pp 237–263
Dua D, Graff C (2019) UCI machine learning repository [http://archive.ics.uci.edu/ml]. University of California, School of Information and Computer Science, Irvine, CA