The Impact of Oversampling with SMOTE on the Performance of 3 Classifiers in Prediction of Type 2 Diabetes

Medical Decision Making - Tập 36 Số 1 - Trang 137-144 - 2016

Azra Ramezankhani^1,2,3,4,5, Omid Pournik^1,2,3,4,5, Jamal Shahrabi^1,2,3,4,5, Fereidoun Azizi^1,2,3,4,5, Farzad Hadaegh^1,2,3,4,5, Davood Khalili^1,2,3,4,5

¹Department of Community Medicine, School of Medicine, Iran University of Medical Sciences, Tehran, Iran (OP)

²Department of Epidemiology, School of Public Health, Shahid Beheshti University of Medical Sciences, Tehran, Iran (DK)

³Endocrine Research Center, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran (FA)

⁴Industrial Engineering Department, Amirkabir University of Technology, Tehran, Iran (JS)

⁵Prevention of Metabolic Disorders Research Center, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran (AR, FH, DK)

Tóm tắt

Objective. To evaluate the impact of the synthetic minority oversampling technique (SMOTE) on the performance of probabilistic neural network (PNN), naïve Bayes (NB), and decision tree (DT) classifiers for predicting diabetes in a prospective cohort of the Tehran Lipid and Glucose Study (TLGS). Methods. Data of the 6647 nondiabetic participants, aged 20 years or older with more than 10 years of follow-up, were used to develop prediction models based on 21 common risk factors. The minority class in the training dataset was oversampled using the SMOTE technique, at 100%, 200%, 300%, 400%, 500%, 600%, and 700% of its original size. The original and the oversampled training datasets were used to establish the classification models. Accuracy, sensitivity, specificity, precision, F-measure, and Youden’s index were used to evaluated the performance of classifiers in the test dataset. To compare the performance of the 3 classification models, we used the ROC convex hull (ROCCH). Results. Oversampling the minority class at 700% (completely balanced) increased the sensitivity of the PNN, DT, and NB by 64%, 51%, and 5%, respectively, but decreased the accuracy and specificity of the 3 classification methods. NB had the best Youden’s index before and after oversampling. The ROCCH showed that PNN is suboptimal for any class and cost conditions. Conclusions. To determine a classifier with a machine learning algorithm like the PNN and DT, class skew in data should be considered. The NB and DT were optimal classifiers in a prediction task in an imbalanced medical database.

Từ khóa

Tài liệu tham khảo

10.1016/j.artmed.2004.07.002

10.1016/j.ijmedinf.2006.11.006

10.1016/j.ijmedinf.2006.01.005

10.1097/00001199-200607000-00003

10.1186/1472-6963-6-18

10.1016/S0933-3657(01)00102-6

10.1007/978-3-540-39804-2_12

10.1613/jair.953

10.1016/j.eswa.2011.12.043

Kumar AMN, 2012, Int J Comput Appl, 44, 1

10.1109/TKDE.2008.239

10.1142/S0218001409007326

10.1016/j.patcog.2007.04.009

10.1145/1007730.1007733

10.1023/A:1007601015854

10.1186/1745-6215-10-5

10.1186/1471-2458-9-186

Knime website. Available at: URL: http://www.Knime.org.

Berthold MR, Cebron N, Dill F, KNIME: The Konstanz Information Miner. New York: Springer; 2008.

Shafer JC, 1996, VLDB J, 544

10.1016/S0925-2312(97)00063-5

10.1016/S0001-2998(78)80014-2

10.1155/2009/617946

10.1002/1097-0142(1950)3:1<32::AID-CNCR2820030106>3.0.CO;2-3

Provost FJ, 1997, Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (KDD-97)

10.1007/s10994-006-8199-5

10.1177/0962280207081867

10.1080/08839510500313653

Drummond C, 2000, ICML ’00 Proceedings of the Seventeenth International Conference on Machine Learning, 239

Klement W, 2009, ICEC Workshop on Data Mining When Classes are Imbalanced and Errors Have Costs, PAKDD; 2009; Bangkok, Thailand

Provost F, 2000, Proceedings of the AAAI 2000 Workshop on Imbalanced Data Sets, 1

Maxion R, 2004, Proper Use of ROC Curves in Intrusion/Anomaly Detection

10.1016/j.jbi.2008.09.001

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA