The Impact of Oversampling with SMOTE on the Performance of 3 Classifiers in Prediction of Type 2 Diabetes

Medical Decision Making - Tập 36 Số 1 - Trang 137-144 - 2016
Azra Ramezankhani1,2,3,4,5, Omid Pournik1,2,3,4,5, Jamal Shahrabi1,2,3,4,5, Fereidoun Azizi1,2,3,4,5, Farzad Hadaegh1,2,3,4,5, Davood Khalili1,2,3,4,5
1Department of Community Medicine, School of Medicine, Iran University of Medical Sciences, Tehran, Iran (OP)
2Department of Epidemiology, School of Public Health, Shahid Beheshti University of Medical Sciences, Tehran, Iran (DK)
3Endocrine Research Center, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran (FA)
4Industrial Engineering Department, Amirkabir University of Technology, Tehran, Iran (JS)
5Prevention of Metabolic Disorders Research Center, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran (AR, FH, DK)

Tóm tắt

Objective. To evaluate the impact of the synthetic minority oversampling technique (SMOTE) on the performance of probabilistic neural network (PNN), naïve Bayes (NB), and decision tree (DT) classifiers for predicting diabetes in a prospective cohort of the Tehran Lipid and Glucose Study (TLGS). Methods. Data of the 6647 nondiabetic participants, aged 20 years or older with more than 10 years of follow-up, were used to develop prediction models based on 21 common risk factors. The minority class in the training dataset was oversampled using the SMOTE technique, at 100%, 200%, 300%, 400%, 500%, 600%, and 700% of its original size. The original and the oversampled training datasets were used to establish the classification models. Accuracy, sensitivity, specificity, precision, F-measure, and Youden’s index were used to evaluated the performance of classifiers in the test dataset. To compare the performance of the 3 classification models, we used the ROC convex hull (ROCCH). Results. Oversampling the minority class at 700% (completely balanced) increased the sensitivity of the PNN, DT, and NB by 64%, 51%, and 5%, respectively, but decreased the accuracy and specificity of the 3 classification methods. NB had the best Youden’s index before and after oversampling. The ROCCH showed that PNN is suboptimal for any class and cost conditions. Conclusions. To determine a classifier with a machine learning algorithm like the PNN and DT, class skew in data should be considered. The NB and DT were optimal classifiers in a prediction task in an imbalanced medical database.

Từ khóa


Tài liệu tham khảo

10.1016/j.artmed.2004.07.002

10.1016/j.ijmedinf.2006.11.006

10.1016/j.ijmedinf.2006.01.005

10.1097/00001199-200607000-00003

10.1186/1472-6963-6-18

10.1016/S0933-3657(01)00102-6

10.1007/978-3-540-39804-2_12

10.1613/jair.953

10.1016/j.eswa.2011.12.043

Kumar AMN, 2012, Int J Comput Appl, 44, 1

10.1109/TKDE.2008.239

10.1142/S0218001409007326

10.1016/j.patcog.2007.04.009

10.1145/1007730.1007733

10.1023/A:1007601015854

10.1186/1745-6215-10-5

10.1186/1471-2458-9-186

Knime website. Available at: URL: http://www.Knime.org.

Berthold MR, Cebron N, Dill F, KNIME: The Konstanz Information Miner. New York: Springer; 2008.

Shafer JC, 1996, VLDB J, 544

10.1016/S0925-2312(97)00063-5

10.1016/S0001-2998(78)80014-2

10.1155/2009/617946

10.1002/1097-0142(1950)3:1<32::AID-CNCR2820030106>3.0.CO;2-3

Provost FJ, 1997, Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (KDD-97)

10.1007/s10994-006-8199-5

10.1177/0962280207081867

10.1080/08839510500313653

Drummond C, 2000, ICML ’00 Proceedings of the Seventeenth International Conference on Machine Learning, 239

Klement W, 2009, ICEC Workshop on Data Mining When Classes are Imbalanced and Errors Have Costs, PAKDD; 2009; Bangkok, Thailand

Provost F, 2000, Proceedings of the AAAI 2000 Workshop on Imbalanced Data Sets, 1

Maxion R, 2004, Proper Use of ROC Curves in Intrusion/Anomaly Detection

10.1016/j.jbi.2008.09.001