Predicting diabetes mellitus using SMOTE and ensemble machine learning approach: The Henry Ford ExercIse Testing (FIT) project

PLoS ONE - Tập 12 Số 7 - Trang e0179805
Manal Alghamdi1,2, Mouaz H. Al‐Mallah3,1,2, Steven J. Keteyian3, Clinton A. Brawner3, Jonathan K. Ehrman3, Sherif Sakr1,2
1King Abdullah International Medical Research Center, Riyadh, Saudia Arabia
2King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia
3Heart and Vascular Institute, Henry Ford Hospital System, Detroit, MI, United States of America

Tóm tắt

Từ khóa


Tài liệu tham khảo

International Diabetes Federation, <ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="uri" xlink:href="http://www.diabetesatlas.org" xlink:type="simple">http://www.diabetesatlas.org</ext-link>.;.

L Rydén, 2007, Guidelines on diabetes, pre-diabetes, and cardiovascular diseases: full text, European Heart Journal Supplements, 9, C3, 10.1093/eurheartj/ehl261

SP Juraschek, 2015, Cardiorespiratory fitness and incident diabetes: the FIT (Henry Ford ExercIse Testing) project, Diabetes Care, 38, 1075, 10.2337/dc14-2714

S Habibi, 2015, Type 2 Diabetes Mellitus Screening and Risk Factors Using Decision Tree: Results of Data Mining, Global journal of health science, 7, 304, 10.5539/gjhs.v7n5p304

M Zhu, 2015, Mortality rates and the causes of death related to diabetes mellitus in Shanghai Songjiang District: an 11-year retrospective analysis of death certificates, BMC endocrine disorders, 15, 45, 10.1186/s12902-015-0042-1

S Leahy, 2015, Prevalence and correlates of diagnosed and undiagnosed type 2 diabetes mellitus and pre-diabetes in older adults: Findings from the Irish Longitudinal Study on Ageing (TILDA), Diabetes research and clinical practice, 110, 241, 10.1016/j.diabres.2015.10.015

L Alhyas, 2012, Prevalence of type 2 diabetes in the States of the co-operation council for the Arab States of the Gulf: a systematic review, PloS one, 7, e40948, 10.1371/journal.pone.0040948

PT Williams, 2008, Vigorous exercise, fitness and incident hypertension, high cholesterol, and diabetes, Medicine and science in sports and exercise, 40, 998, 10.1249/MSS.0b013e31816722a9

S Wild, 2004, Global prevalence of diabetes estimates for the year 2000 and projections for 2030, Diabetes care, 27, 1047, 10.2337/diacare.27.5.1047

D Statistics, 1999, National Institute of Diabetes and Digestive and Kidney Diseases, 99

I Kononenko, 2001, Machine learning for medical diagnosis: history, state of the art and perspective, Artificial Intelligence in medicine, 23, 89, 10.1016/S0933-3657(01)00077-X

CC Aggarwal, 2014, Data classification: algorithms and applications, 10.1201/b17320

MH Al-Mallah, 2014, Rationale and design of the Henry Ford Exercise Testing Project (the FIT project), Clinical cardiology, 37, 456, 10.1002/clc.22302

AL Blum, 1997, Selection of relevant features and examples in machine learning, Artificial intelligence, 97, 245, 10.1016/S0004-3702(97)00063-5

I Guyon, 2003, An introduction to variable and feature selection, Journal of machine learning research, 3, 1157

JT Kent, 1983, Information gain and a general measure of correlation, Biometrika, 70, 163, 10.1093/biomet/70.1.163

SB Kotsiantis, 2007, Supervised machine learning: A review of classification techniques

XH Meng, 2013, Comparison of three data mining models for predicting diabetes or prediabetes by risk factors, The Kaohsiung journal of medical sciences, 29, 93, 10.1016/j.kjms.2012.08.016

SE Stern, 2005, Identification of individuals with insulin resistance using routine clinical measurements, Diabetes, 54, 333, 10.2337/diabetes.54.2.333

JL Breault, 2002, Data mining a diabetic data warehouse, Artificial intelligence in medicine, 26, 37, 10.1016/S0933-3657(02)00051-9

JR Quinlan, 2014, C4. 5: programs for machine learning

R Kohavi, 1996, KDD, vol. 96, 202

S Le Cessie, 1992, Ridge estimators in logistic regression, Applied statistics, 191, 10.2307/2347628

John GH, Langley P. Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the Eleventh conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc.; 1995. p. 338–345.

N Landwehr, 2005, Logistic model trees, Machine Learning, 59, 161, 10.1007/s10994-005-0466-3

Sumner M, Frank E, Hall M. Speeding up logistic model tree induction. In: European Conference on Principles of Data Mining and Knowledge Discovery. Springer; 2005. p. 675–683.

A Liaw, 2002, Classification and regression by randomForest, R news, 2, 18

L Breiman, 2001, Random forests, Machine learning, 45, 5, 10.1023/A:1010933404324

GE Batista, 2004, A study of the behavior of several methods for balancing machine learning training data, ACM Sigkdd Explorations Newsletter, 6, 20, 10.1145/1007730.1007735

G Menardi, 2014, Training and assessing classification rules with imbalanced data, Data Mining and Knowledge Discovery, 28, 92, 10.1007/s10618-012-0295-5

V Ganganwar, 2012, An overview of classification algorithms for imbalanced datasets, International Journal of Emerging Technology and Advanced Engineering, 2, 42

H He, 2009, Learning from imbalanced data, IEEE Transactions on knowledge and data engineering, 21, 1263, 10.1109/TKDE.2008.239

Poolsawad N, Kambhampati C, Cleland J. Balancing class for performance of classification with a clinical dataset. In: Proceedings of the World Congress on Engineering. vol. 1; 2014.

Wang J, Xu M, Wang H, Zhang J. Classification of imbalanced data by using the SMOTE algorithm and locally linear embedding. In: 2006 8th international Conference on Signal Processing. vol. 3. IEEE; 2006.

García V, Alejo R, Sánchez JS, Sotoca JM, Mollineda RA. Combined effects of class imbalance and class overlap on instance-based classification. In: International Conference on Intelligent Data Engineering and Automated Learning. Springer; 2006. p. 371–378.

CR Jack, 2008, The Alzheimer’s disease neuroimaging initiative (ADNI): MRI methods, Journal of Magnetic Resonance Imaging, 27, 685, 10.1002/jmri.21049

L Lusa, 2015, Joint use of over-and under-sampling techniques and cross-validation for the development and assessment of prediction models, BMC bioinformatics, 16, 1

NV Chawla, 2005, Data mining and knowledge discovery handbook, 853

P Refaeilzadeh, 2009, Encyclopedia of database systems, 532

JH Kim, 2009, Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap, Computational Statistics & Data Analysis, 53, 3735, 10.1016/j.csda.2009.04.009

R Kohavi, 1995, IJCAI, vol. 14, 1137

Y Bengio, 2004, No unbiased estimator of the variance of k-fold cross-validation, Journal of Machine Learning Research, 5, 1089

B Liu, 2015, Identification of real microRNA precursors with a pseudo structure status composition approach, PloS one, 10, e0121501, 10.1371/journal.pone.0121501

B Liu, 2016, iMiRNA-PseDPC: microRNA precursor identification with a pseudo distance-pair composition approach, Journal of Biomolecular Structure and Dynamics, 34, 223, 10.1080/07391102.2015.1014422

Y Zhang, 2014, Abstract and Applied Analysis, vol. 2014

B Liu, 2016, Identification of DNA-binding proteins by combining auto-cross covariance transformation and ensemble learning, IEEE transactions on nanobioscience, 15, 328, 10.1109/TNB.2016.2555951

B Liu, 2016, iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework, Bioinformatics, 32, 2411, 10.1093/bioinformatics/btw186

B Liu, 2017, iRSpot-EL: identify recombination spots with an ensemble learning approach, Bioinformatics, 33, 35, 10.1093/bioinformatics/btw539

L Song, 2014, nDNA-prot: identification of DNA-binding proteins based on unbalanced classification, BMC bioinformatics, 15, 298, 10.1186/1471-2105-15-298

C Wang, 2015, imDC: an ensemble learning method for imbalanced classification with miRNA data, Genetics and Molecular Research, 14, 123, 10.4238/2015.January.15.15

JR Quinlan, 1986, Induction of decision trees, Machine learning, 1, 81, 10.1007/BF00116251

G Seni, 2010, Ensemble methods in data mining: improving accuracy through combining predictions, Synthesis Lectures on Data Mining and Knowledge Discovery, 2, 1, 10.2200/S00240ED1V01Y200912DMK002

B Farran, 2013, Predictive models to assess risk of type 2 diabetes, hypertension and comorbidity: machine-learning algorithms and validation using national health data from Kuwait—a cohort study, BMJ open, 3, e002457, 10.1136/bmjopen-2012-002457

D Tomar, 2013, A survey on Data Mining approaches for Healthcare, International Journal of Bio-Science and Bio-Technology, 5, 241, 10.14257/ijbsbt.2013.5.5.25