Accurate Diabetes Risk Stratification Using Machine Learning: Role of Missing Value and Outliers

Journal of Medical Systems - Tập 42 - Trang 1-17 - 2018
Md. Maniruzzaman1,2, Md. Jahanur Rahman1, Md. Al-MehediHasan3, Harman S. Suri4, Md. Menhazul Abedin5, Ayman El-Baz6, Jasjit S. Suri7,8
1Department of Statistics, University of Rajshahi, Rajshahi, Bangladesh
2The JiVitA Project of Johns Hopkins University, Gaibandha, Bangladesh
3Department of Computer Science and Engineering, Rajshahi University of Engineering and Technology, Rajshahi, Bangladesh
4Brown University, Providence, USA
5Statistics Discipline, Khulna University, Khulna, Bangladesh
6Department of Bioengineering, University of Louisville, Louisville, USA
7Stroke Monitoring and Diagnostic Division, AtheroPoint™ LLC, Roseville, USA
8Knowledge Engineering Center, Global Biomedical Technologies, Roseville, USA

Tóm tắt

Diabetes mellitus is a group of metabolic diseases in which blood sugar levels are too high. About 8.8% of the world was diabetic in 2017. It is projected that this will reach nearly 10% by 2045. The major challenge is that when machine learning-based classifiers are applied to such data sets for risk stratification, leads to lower performance. Thus, our objective is to develop an optimized and robust machine learning (ML) system under the assumption that missing values or outliers if replaced by a median configuration will yield higher risk stratification accuracy. This ML-based risk stratification is designed, optimized and evaluated, where: (i) the features are extracted and optimized from the six feature selection techniques (random forest, logistic regression, mutual information, principal component analysis, analysis of variance, and Fisher discriminant ratio) and combined with ten different types of classifiers (linear discriminant analysis, quadratic discriminant analysis, naïve Bayes, Gaussian process classification, support vector machine, artificial neural network, Adaboost, logistic regression, decision tree, and random forest) under the hypothesis that both missing values and outliers when replaced by computed medians will improve the risk stratification accuracy. Pima Indian diabetic dataset (768 patients: 268 diabetic and 500 controls) was used. Our results demonstrate that on replacing the missing values and outliers by group median and median values, respectively and further using the combination of random forest feature selection and random forest classification technique yields an accuracy, sensitivity, specificity, positive predictive value, negative predictive value and area under the curve as: 92.26%, 95.96%, 79.72%, 91.14%, 91.20%, and 0.93, respectively. This is an improvement of 10% over previously developed techniques published in literature. The system was validated for its stability and reliability. RF-based model showed the best performance when outliers are replaced by median values.

Tài liệu tham khảo

Muntner, P., Colantonio, L. D., Cushman, M., Goff, D. C., Howard, G., Howard, V. J., and Safford, M. M., Validation of the atherosclerotic cardiovascular disease pooled cohort risk equations. JAMA 311(14):1406–1415, 2014. American Diabetes Association, Diagnosis and classification of diabetes mellitus. Diabetes Care 37(Supplement 1):S81–S90, 2014. Bharath, C., Saravanan, N., and Venkatalakshmi, S., Assessment of knowledge related to diabetes mellitus among patients attending a dental college in Salem city-A cross sectional study. Braz. Dent. Sci. 20(3):93–100, 2017. Fitzmaurice, C., Allen, C., Barber, R. M., Barregard, L., Bhutta, Z. A., Brenner, H., and Fleming, T., Global, regional, and national cancer incidence, mortality, years of life lost, years lived with disability, and disability-adjusted life-years for 32 cancer groups, 1990 to 2015: a systematic analysis for the global burden of disease study. JAMA Oncol. 3(4):524–548, 2017. Danaei, G., Finucane, M. M., Lu, Y., Singh, G. M., Cowan, M. J., Paciorek, C. J., and Rao, M., National, regional, and global trends in fasting plasma glucose and diabetes prevalence since 1980: systematic analysis of health examination surveys and epidemiological studies with 370 country-years and 2.7 million participants. Lancet 378(9785):31–40, 2011. Canadian Diabetes Association, Diabetes: Canada at the tipping point 2011. Canadian Diabetes Association: Toronto, 2013. Shi, Y., and Hu, F. B., The global implications of diabetes and cancer. Lancet 9933(383):1947–1948, 2014. Barakat, N., Bradley, A. P., and Barakat, M. N. H., Intelligible support vector machines for diagnosis of diabetes mellitus. IEEE Trans. Inf. Technol. Biomed. 14(4):1114–1120, 2010. Maniruzzaman, M., Kumar, N., Abedin, M. M., Islam, M. S., Suri, H. S., El-Baz, A. S., and Suri, J. S., Comparative approaches for classification of diabetes mellitus data: Machine learning paradigm. Comput. Methods Prog. Biomed. 152:23–34, 2017. Bashir, S., Qamar, U., and Khan, F. H., IntelliHealth: a medical decision support application using a novel weighted multi-layer classifier ensemble framework. J. Biomed. Inform. 59:185–200, 2016. Manikandan, S., Measures of dispersion. J. Pharmacol. Pharmacother. 2(4):315–316, 2011. Zainuri, N. A., Jemain, A. A., and Muda, N., A Comparison of various imputation methods for missing values in air quality data. Sains Malays. 44(3):449–456, 2015. Cokluk, O., and Kayri, M., The effects of methods of imputation for missing values on the validity and reliability of scales. Educ. Sci. Theory Pract. 11(1):303–309, 2011. Baneshi, M. R., and Talei, A. R., Does the missing data imputation method affect the composition and performance of prognostic models? Iran Red Crescent Med J 14(1):30–31, 2012. Kaiser, J., Dealing with missing values in data. J. Syst. Integr. 5(1):42–43, 2014. Leys, C., Ley, C., Klein, O., Bernard, P., and Licata, L., Detecting outliers: do not use standard deviation around the mean, use absolute deviation around the median. J. Exp. Soc. Psychol. 49(4):764–766, 2013. Hasan, M. A. M., Nasser, M., Ahmad, S., and Molla, K. I., Feature selection for intrusion detection using random forest. J. Inf. Secur. 7(3):129–140, 2016. Breiman, L., Random forests. Mach. Learn. 45(1):5–32, 2001. Shrivastava, V. K., Londhe, N. D., Sonawane, R. S., and Suri, J. S., Computer-aided diagnosis of psoriasis skin images with HOS, texture and color features: a first comparative study of its kind. Comput. Methods Prog. Biomed. 126(2):98–109, 2016. Shrivastava, V. K., Londhe, N. D., Sonawane, R. S., and Suri, J. S., A novel and robust Bayesian approach for segmentation of psoriasis lesions and its risk stratification. Comput. Methods Prog. Biomed. 150(2):9–22, 2017. Peng, H., Long, F., and Ding, C., Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8):1226–1238, 2005. Al Mehedi Hasan, M., Nasser, M., and Pal, B., On the KDD’99 dataset: support vector machine based intrusion detection system (ids) with different kernels. Int. J. Electron. Commun. Comput. Eng. 4(4):1164–1170, 2013. Sapatinas, T., Discriminant analysis and statistical pattern recognition. J. R. Stat. Soc. A. Stat. Soc. 168(3):635–636, 2005. Webb, G. I., Boughton, J. R., and Wang, Z., Not so naive Bayes: aggregating one- dependence estimators. Mach. Learn. 58(1):5–24, 2005. Cover, T. M., Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans. Electron. Comput. 14(3):326–334, 1965. Brahim-Belhouari, S., and Bermak, A., Gaussian process for nonstationary time series prediction. Comput. Stat. Data Anal. 47(4):705–712, 2004. Cortes, C., and Vapnik, V., Support-vector networks. Mach. Learn. 20(2):273–297, 1995. Reinhardt, T. H., Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Res. 26(9):2230–2236, 1998. Kégl, B. The return of AdaBoost. MH: multi-class Hamming trees. arXiv preprint arXiv:1312.6086, 2013. Tabaei, B., and Herman, W., A Multivariate logistic regression equation to screen for diabetes. Diabetes Care 25:1999–2003, 2002. Acharya, U. R., Molinari, F., Sree, S. V., Chattopadhyay, S., Ng, K. H., and Suri, J. S., Automated diagnosis of epileptic EEG using entropies. Biomed. Signal Process. Control 7(4):401–408, 2012. Karthikeyani, V., Begum, I. P., Tajudin, K., and Begam, I. S., Comparative of data mining classification algorithm in diabetes disease prediction. Int. J. Comput. Appl. 60(12):26–31, 2012. Karthikeyani, V., and Begum, I. P., Comparison a performance of data mining algorithms in prediction of diabetes disease. Int. J. Comput. Sci. Eng. 5(3):205–210, 2013. Kumari, V. A., and Chitra, R., Classification of diabetes disease using support vector machine. Int. J. Eng. Res. Appl. 3(2):1797–1801, 2013. Parashar, A., Burse, K., and Rawat, K., A Comparative approach for Pima Indians diabetes diagnosis using lda-support vector machine and feed forward neural network. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 4(4):378–383, 2014. Bozkurt, M. R., Yurtay, N., Yilmaz, Z., and Sertkaya, C., Comparison of different methods for determining diabetes. Turk. J. Electr. Eng. Comput. Sci. 22(4):1044–1055, 2014. Iyer, A., Jeyalatha, S., and Sumbaly, R., Diagnosis of diabetes using classification mining techniques. Int. J. Data Min. Knowl. Manag. Process. 5(1):1–14, 2015. Kumar Dewangan, A., and Agrawal, P., Classification of diabetes mellitus using machine learning techniques. Int. J. Eng. Appl. Sci. 2(5):145–148, 2015. Sivanesan, R., and Dhivya, K. D. R., A Review on diabetes mellitus diagnoses using classification on Pima Indian diabetes data set. Int. J. Adv. Res. Comput. Sci. Manag. Stud. 5(1):12–17, 2017. Nabi, M., Wahid, A., and Kumar, P., Performance analysis of classification algorithms in predicting diabetes. Int. J. Adv. Res. Comput. Sci. 8(3):456–461, 2017.