Classification of breast cancer recurrence based on imputed data: a simulation study

BioData Mining - Tập 15 - Trang 1-13 - 2022

Rahibu A. Abassi¹, Amina S. Msengwa²

¹Department of Natural Sciences, State University of Zanzibar, Zanzibar, Tanzania

²Department of Statistics, University of Dar es Salaam, Dar es Salaam, Tanzania

Tóm tắt

Several studies have been conducted to classify various real life events but few are in medical fields; particularly about breast recurrence under statistical techniques. To our knowledge, there is no reported comparison of statistical classification accuracy and classifiers’ discriminative ability on breast cancer recurrence in presence of imputed missing data. Therefore, this article aims to fill this analysis gap by comparing the performance of binary classifiers (logistic regression, linear and quadratic discriminant analysis) using several datasets resulted from imputation process using various simulation conditions. Our study aids the knowledge about how classifiers’ accuracy and discriminative ability in classifying a binary outcome variable are affected by the presence of imputed numerical missing data. We simulated incomplete datasets with 15, 30, 45 and 60% of missingness under Missing At Random (MAR) and Missing Completely At Random (MCAR) mechanisms. Mean imputation, hot deck, k-nearest neighbour, multiple imputations via chained equation, expected-maximisation, and predictive mean matching were used to impute incomplete datasets. For each classifier, correct classification accuracy and area under the Receiver Operating Characteristic (ROC) curves under MAR and MCAR mechanisms were compared. The linear discriminant classifier attained the highest classification accuracy (73.9%) based on mean-imputed data at 45% of missing data under MCAR mechanism. As a classifier, the logistic regression based on predictive mean matching imputed-data yields the greatest areas under ROC curves (0.6418) at 30% missingness while k-nearest neighbour tops the value (0.6428) at 60% of missing data under MCAR mechanism.

Tài liệu tham khảo

Nekouie A, Moattar MH. Missing Value Imputation for Breast Cancer Diagnosis Data Using Tensor Factorization Improved by Enhanced Reduced Adaptive Particle Swarm Optimization Atefeh Nekouie Cancer refers to a disease in which a group of cells show uncontrolled growth , invasion . J King Saud Univ - Comput Inf Sci [Internet]. 2018; Available from: https://doi.org/10.1016/j.jksuci.2018.01.006. Humphries M. Missing Data & How to Deal: an overview of missing data. Popul Res Cent [Internet] 2013;45. Available from: http://www.texaslonghornsl.com/cola/centers/prc/_files/cs/Missing-Data.pdf de Goeij MC, van Diepen M, Jager KJ, Tripepi G, Zoccali C, Dekker FW. Multiple imputation: dealing with missing data. Nephrol Dial Trans. 2013;28(10):2415–20. Zhang Z. Missing data imputation: focusing on single imputation. Ann Transl Med. 2016;4(1). https://doi.org/10.3978/j.issn.2305-5839.2015.12.38. Iren M, Tokle R. Comparison of Missing data imputation methods for improving detection of obstructive sleep apnea; 2017. Little RJ, Rubin DB. Statistical Analysis with Missing data: Willey; 1987. Curley C, Krause RM, Feiock R, Hawkins CV. Dealing with missing data: A comparative exploration of approaches using the integrated city sustainability database. Urb Aff Rev. 2019;55(2):591–615. Alruhaymi AZ, Kim CJ. Study on the Missing Data Mechanisms and Imputation Methods. Open J of Stat. 2021;11(4):477–92. Luengo J, García S, Herrera F. On the choice of the best imputation methods for missing values considering three groups of classification methods. Know and Inform Sys. 2012;32(1):77–108. Jerez JM, Molina I, Subirats JL, Franco L. Missing data imputation in breast cancer prognosis. Survival. 2006;8(9):1. Hallgren KA. Conducting simulation studies in the R programming environment. Tutor In Quan Meth For Psychol. 2013;9(2):43. Morris TP, White IR, Royston P. Tuning multiple imputation by predictive mean matching and local residual draws. BMC Med Res Methodol. 2014;14(1):1–3. Jerez JM, Molina I, García-Laencina PJ, Alba E, Ribelles N, Martín M, et al. Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Arti Intell In Med. 2010;50(2):105–15. Pazhoohesh M, Pourmirza Z, Walker S. A comparison of methods for missing data treatment in building sensor data. In: In2019 IEEE 7th International Conference on Smart Energy Grid Engineering (SEGE), vol. 12. Oshawa, ON, Canada: IEEE; 2019. p. 255–9. Hendriksen J, Geersing G, Moons KG, H GA. Diagnostic and prognostic prediction models. J of Throm and Haemos. 2013;11:129–41. Burson et al. NIH public access. Bone [Internet] 2014;23(1):1–7. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3624763/pdf/nihms412728.pdf. Song WJ, Kim KI, Park SH, Kwon MS, Lee TH, Park HK, et al. The risk factors influencing between the early and late recurrence in systemic recurrent breast cancer. J of Br Can. 2012;15(2):218–23. James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning with applications in R [internet], vol. 102: Design. Springer; 2014. p. 618. http://books.google.com/books?id=9tv0taI8l6YC Casella G, Fienberg S, Olkin I. An introduction to statistical learning with applications in R: Springer Texts in Statistics; 2014. Agrest A. Categorical data Analysis. Second Edi: Willey; 2002. Johnson R, Wichern D. In: Recter P, Hoag C, Ryan D, editors. Applied multivariate statistical Analysis. 6th ed. New Jersey: Pearson Education, Inc.; 2007. Xiaozhou Y. Linear Discriminant Analysis, Explained : Towards Data Science [Internet]. 2020 [cited 2021 Aug 24]. Available from: https://towardsdatascience.com/linear-discriminant-analysis-explained-f88be6c1e00b James G, Witten D, Hastie T, Tibshirani R. In: Casella G, Fienberg S, Olkin I, editors. An introduction to statistical learning with applications in R: Springer Texts in Statistics; 2014. Roussas G. Some Generalizations to k Random Variables, and Three Multivariate Distributions. Academic Press. 2014;179–199. https://doi.org/10.1016/B978-0-12-800041-0.00009-2. Tacq J. Multivariate normal distribution. International Encyclopedia of Education. 2010;332–8. https://doi.org/10.1016/B978-0-08-044894-7.01351-8. Ripley B, Venables B, Bates DM, Firth D, Hornik K, Gebhardt A. Support Functions and Datasets for Venables and Ripley’s MASS. 2018 [cited 2022 Jan 17];169. Available from: http://www.stats.ox.ac.uk/pub/MASS4/ Schouten RM, Lugtig P, Vink G. Generating missing values for simulation purposes: a multivariate amputation procedure. J of Stat Com and Sim. 2018;88(15):2909–30. https://doi.org/10.1080/00949655.2018.1491577. Glas CA. Imputation methods. Int Encycl Educ 2010;(Third Edition). Andridge RR, Little RJ. A review of hot deck imputation for survey non-response. Int Stat Rev. 2011;78(1):40–64. https://doi.org/10.1111/j.1751-5823.2010.00103.x. Kowarik A, Templ M. Imputation with the R Package VIM. J of Stat Soft. 2016;20(74):1–6. Beretta L, Santaniello A. Nearest neighbor imputation algorithms : a critical evaluation. BMC Med Inform Decis Mak [Internet]. 2016;16(Suppl 3). https://doi.org/10.1186/s12911-016-0318-z. Van Buuren S, Oudshoorn K. Flexible multivariate imputation by MICE. Leiden: TNO; 1999. Van Buuren S, Groothuis-Oudshoorn K. Mice: Multivariate imputation by chained equations in R. J of Stat Soft. 2011;12(45):1–67. Akmam EF, Siswantining T, Soemartojo SM, Sarwinda D. Multiple Imputation with Predictive Mean Matching Method for Numerical Missing Data. In: In2019 3rd International Conference on Informatics and Computational Sciences (ICICoS), vol. 29. Semarang, Indonesia: IEEE; 2019. p. 1–6. Bailey BE, Andridge R, Shoben AB. Multiple imputation by predictive mean matching in cluster-randomized trials. BMC Med Res Methodol. 2020;20(1):1–16. Takahashi M. Multiple ratio imputation by the EMB algorithm: Theory and simulation. J of Mod App Stat Method. 2017;16(1):34. Do CB, Batzoglou S. What is the expectation maximization algorithm? Nat Biotech. 2008;26(8):897–9. Javadi S, Bahrampour A, Saber MM, Garrusi B, Baneshi MR. Evaluation of four multiple imputation methods for handling missing binary outcome data in the presence of an interaction between a dummy and a continuous variable. J of Prob and Stat. 2021;2021:6668822. https://doi.org/10.1155/2021/6668822. Kleinke K. Multiple imputation under violated distributional assumptions: a systematic evaluation of the assumed robustness of predictive mean matching. J Educ Behav Stat. 2017;42(4):371–404. Ghorbani S, Desmarais MC. Performance comparison of recent imputation methods for classification tasks over binary data. Appl Arti Int. 2017;31(1):1–22 https://www.tandfonline.com/action/journalInformation?journalCode=uaai20. Rabinovici-Cohen S, Fernández XM, Grandal Rejo B, Hexter E, Hijano Cubelos O, Pajula J, et al. Multimodal prediction of five-year breast Cancer recurrence in women who receive Neoadjuvant chemotherapy. Cancers (Basel). 2022;14(16):3848.

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA