Detection of colon cancer based on microarray dataset using machine learning as a feature selection and classification techniques
Tóm tắt
Microarray data is an increasingly important tool for providing information on gene expression for analysis and interpretation. Researchers attempt to utilize the smallest possible set of relevant gene expression profiles in most gene expression studies to enhance tumor identification accuracy. This research aims to analyze and predicts colon cancer data employing a machine learning approach and feature selection technique based on a random forest classifier. More particularly, our proposed method can reduce the burden of high dimensional data and allow faster calculations by combining the “Mean Decrease Accuracy” and “Mean Decrease Gini” as feature selection methods into a renowned classifier namely Random Forest, with the aim of increasing the prediction model's accuracy level. In addition, we have also shown a comparative model analysis with selection of features and model without selection of features. The extensive experimental results have demonstrated that the proposed model with feature selection is favorable and effective which triumphs the best performance of accuracy.
Tài liệu tham khảo
World Health Organization (WHO) Cancer. Updated September 12, 2018. Accessed November 26, 2019
Siegel Rebecca L, Kimberly D, Miller JA (2019) Cancer statistics. CA Cancer J Clin 69(1):7–34
Wong Martin CS, Ding H, Wang J, Chan SFP, Huang J (2019) Prevalence and risk factors of colorectal cancer in Asia. Intest Res 17(3):317–329
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Bloomfield CD (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537
Xi M, Sun J, Liu L, Fan F, Wu X (2016) Cancer feature selection and classification using a binary quantum-behaved particle swarm optimization and support vector machine. Comput Math Methods Med 2016:1–9
Ghazavi SN, Liao TW (2008) Medical data mining by fuzzy modeling with selected features'. Artif Intell Med 43(3):195–206
Blum AL, Langley P (1997) Selection of relevant features and examples in machine learning. Artif Intell 97(1–2):245–271
Das S (2001) Filters, wrappers and a boosting-based hybrid for feature selection. In: Proceedings of the 18th international conference on machine learning. kaufmann publishers, San Francisco, Calif, USA
Nguyen HN, Vu TN, Ohn SY, Park YM, Han M.Y, Kim CW (2006) feature elimination approach based on random forest for cancer diagnosis. In: Mexican international conference on artificial intelligence. Springer
Ram M, Najafi A, Shakeri M (2017) Classification and biomarker genes selection for cancer gene expression data using random forest. Iran J Pathol 12(4):339–347
Park CH, Kim SB (2015) Sequential random k-nearest neighbor feature selection for high-dimensional data. Expert Syst Appl 42(5):2336–2342
Alladi SM, Shinde SP, Ravi V, Murthy US (2008) Colon cancer prediction with genetic profiles using intelligent techniques. Bioinformation 3(3):130–133
Ludwig SA, Picek S, Jakobovic D (2018) Classification of cancer data: analyzing gene expression data using a fuzzy decision tree algorithm. Springer, Berlin, pp 327–347
Nguyen T, Khosravi A, Creighton D, Nahavandi S (2015) A novel aggregate gene selection method for microarray data classification. Pattern Recognit Lett 60–61:16–23
Gao L, Ye M, Wu C (2017) Cancer classification based on support vector machine optimized by particle swarm optimization and artificial bee colony. Molecules 22(12):2086
Salem H, Attiya G, El-Fishawy N (2017) Classification of human cancer diseases by gene expression profiles. Appl Soft Comput 50:124–134
Aalaei S, Shahraki H, Rowhanimanesh A, Eslami S (2016) Feature selection using genetic algorithm for breast cancer diagnosis: an experiment on three different datasets. Iran J Basic Med Sci 19(5):476–482
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 96:6745–6750
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Han H, Guo X, Yu H (2016) Variable selection using mean decrease accuracy and mean decrease gini based on random forest. In: 7th IEEE international conference on software engineering and service science, Beijing, 219–224
Strobl C, Boulesteix AL, Kneib T, Augustin T, Zeileis A (2008) Conditional variable importance for random forests. BMC Bioinform 9(1):307
Chang LY, Wang HW (2006) Analysis of traffic injury severity: an application of non-parametric classification tree techniques. Accid Anal Prev 38(5):1019–1027
Harb R, Yan XD, Radwan E, Su XG (2009) Exploring precrash maneuvers using classification trees and random forests. Accid Anal Prev 41:98–107
Dai B, Chen RC, Zhu SZ, Zhang WW (2018) Using random forest algorithm for breast cancer diagnosis. In: 2018 international symposium on computer, consumer and control (IS3C), IEEE