Detection of colon cancer based on microarray dataset using machine learning as a feature selection and classification techniques

Springer Science and Business Media LLC - Tập 2 - Trang 1-8 - 2020
A. S. M. Shafi1,2, M. M. Imran Molla2, Julakha Jahan Jui3, Mohammad Motiur Rahman1
1Department of Computer Science and Engineering, Mawlana Bhashani Science and Technology University, Tangail, Bangladesh
2Faculty of Computer Science and Engineering, Khwaja Yunus Ali University, Enayetpur, Sirajgonj, Bangladesh
3Faculty of Electrical and Electronics Engineering, Universiti Malaysia Pahang, Pekan, Malaysia

Tóm tắt

Microarray data is an increasingly important tool for providing information on gene expression for analysis and interpretation. Researchers attempt to utilize the smallest possible set of relevant gene expression profiles in most gene expression studies to enhance tumor identification accuracy. This research aims to analyze and predicts colon cancer data employing a machine learning approach and feature selection technique based on a random forest classifier. More particularly, our proposed method can reduce the burden of high dimensional data and allow faster calculations by combining the “Mean Decrease Accuracy” and “Mean Decrease Gini” as feature selection methods into a renowned classifier namely Random Forest, with the aim of increasing the prediction model's accuracy level. In addition, we have also shown a comparative model analysis with selection of features and model without selection of features. The extensive experimental results have demonstrated that the proposed model with feature selection is favorable and effective which triumphs the best performance of accuracy.

Tài liệu tham khảo

World Health Organization (WHO) Cancer. Updated September 12, 2018. Accessed November 26, 2019 Siegel Rebecca L, Kimberly D, Miller JA (2019) Cancer statistics. CA Cancer J Clin 69(1):7–34 Wong Martin CS, Ding H, Wang J, Chan SFP, Huang J (2019) Prevalence and risk factors of colorectal cancer in Asia. Intest Res 17(3):317–329 Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Bloomfield CD (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537 Xi M, Sun J, Liu L, Fan F, Wu X (2016) Cancer feature selection and classification using a binary quantum-behaved particle swarm optimization and support vector machine. Comput Math Methods Med 2016:1–9 Ghazavi SN, Liao TW (2008) Medical data mining by fuzzy modeling with selected features'. Artif Intell Med 43(3):195–206 Blum AL, Langley P (1997) Selection of relevant features and examples in machine learning. Artif Intell 97(1–2):245–271 Das S (2001) Filters, wrappers and a boosting-based hybrid for feature selection. In: Proceedings of the 18th international conference on machine learning. kaufmann publishers, San Francisco, Calif, USA Nguyen HN, Vu TN, Ohn SY, Park YM, Han M.Y, Kim CW (2006) feature elimination approach based on random forest for cancer diagnosis. In: Mexican international conference on artificial intelligence. Springer Ram M, Najafi A, Shakeri M (2017) Classification and biomarker genes selection for cancer gene expression data using random forest. Iran J Pathol 12(4):339–347 Park CH, Kim SB (2015) Sequential random k-nearest neighbor feature selection for high-dimensional data. Expert Syst Appl 42(5):2336–2342 Alladi SM, Shinde SP, Ravi V, Murthy US (2008) Colon cancer prediction with genetic profiles using intelligent techniques. Bioinformation 3(3):130–133 Ludwig SA, Picek S, Jakobovic D (2018) Classification of cancer data: analyzing gene expression data using a fuzzy decision tree algorithm. Springer, Berlin, pp 327–347 Nguyen T, Khosravi A, Creighton D, Nahavandi S (2015) A novel aggregate gene selection method for microarray data classification. Pattern Recognit Lett 60–61:16–23 Gao L, Ye M, Wu C (2017) Cancer classification based on support vector machine optimized by particle swarm optimization and artificial bee colony. Molecules 22(12):2086 Salem H, Attiya G, El-Fishawy N (2017) Classification of human cancer diseases by gene expression profiles. Appl Soft Comput 50:124–134 Aalaei S, Shahraki H, Rowhanimanesh A, Eslami S (2016) Feature selection using genetic algorithm for breast cancer diagnosis: an experiment on three different datasets. Iran J Basic Med Sci 19(5):476–482 Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 96:6745–6750 Breiman L (2001) Random forests. Mach Learn 45(1):5–32 Han H, Guo X, Yu H (2016) Variable selection using mean decrease accuracy and mean decrease gini based on random forest. In: 7th IEEE international conference on software engineering and service science, Beijing, 219–224 Strobl C, Boulesteix AL, Kneib T, Augustin T, Zeileis A (2008) Conditional variable importance for random forests. BMC Bioinform 9(1):307 Chang LY, Wang HW (2006) Analysis of traffic injury severity: an application of non-parametric classification tree techniques. Accid Anal Prev 38(5):1019–1027 Harb R, Yan XD, Radwan E, Su XG (2009) Exploring precrash maneuvers using classification trees and random forests. Accid Anal Prev 41:98–107 Dai B, Chen RC, Zhu SZ, Zhang WW (2018) Using random forest algorithm for breast cancer diagnosis. In: 2018 international symposium on computer, consumer and control (IS3C), IEEE