Azra Ramezankhani1,2,3,4,5, Omid Pournik1,2,3,4,5, Jamal Shahrabi1,2,3,4,5, Fereidoun Azizi1,2,3,4,5, Farzad Hadaegh1,2,3,4,5, Davood Khalili1,2,3,4,5
1Department of Community Medicine, School of Medicine, Iran University of Medical Sciences, Tehran, Iran (OP)
2Department of Epidemiology, School of Public Health, Shahid Beheshti University of Medical Sciences, Tehran, Iran (DK)
3Endocrine Research Center, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran (FA)
4Industrial Engineering Department, Amirkabir University of Technology, Tehran, Iran (JS)
5Prevention of Metabolic Disorders Research Center, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran (AR, FH, DK)
Tóm tắt
Objective. To evaluate the impact of the synthetic minority oversampling technique (SMOTE) on the performance of probabilistic neural network (PNN), naïve Bayes (NB), and decision tree (DT) classifiers for predicting diabetes in a prospective cohort of the Tehran Lipid and Glucose Study (TLGS). Methods. Data of the 6647 nondiabetic participants, aged 20 years or older with more than 10 years of follow-up, were used to develop prediction models based on 21 common risk factors. The minority class in the training dataset was oversampled using the SMOTE technique, at 100%, 200%, 300%, 400%, 500%, 600%, and 700% of its original size. The original and the oversampled training datasets were used to establish the classification models. Accuracy, sensitivity, specificity, precision, F-measure, and Youden’s index were used to evaluated the performance of classifiers in the test dataset. To compare the performance of the 3 classification models, we used the ROC convex hull (ROCCH). Results. Oversampling the minority class at 700% (completely balanced) increased the sensitivity of the PNN, DT, and NB by 64%, 51%, and 5%, respectively, but decreased the accuracy and specificity of the 3 classification methods. NB had the best Youden’s index before and after oversampling. The ROCCH showed that PNN is suboptimal for any class and cost conditions. Conclusions. To determine a classifier with a machine learning algorithm like the PNN and DT, class skew in data should be considered. The NB and DT were optimal classifiers in a prediction task in an imbalanced medical database.