Improved gene expression diagnosis via cascade entropy-fisher score and ensemble classifiers

Bolourchi, Pouya1
1Electrical and Electronic Engineering, Final International University, Girne, Turkey

Tóm tắt

Feature selection is an important technique used in bioinformatics modeling to reduce the dimensionality of high-dimensional data. However, filter-based approaches that have shown better performance often depend on specific measurement methods, which can limit their effectiveness. To address this problem, this paper proposes a novel cascade feature selection approach, named the cascade entropy-fisher score (CEFS), that combines entropy score (ES)-based and Fisher score (FS)-based feature selection. CEFS involves a two-step process where in the first step, the entropy of each gene in the dataset is calculated to measure the uncertainty associated with its expression levels across different samples. In the second step, the Fisher score is computed to measure the extent to which the gene's expression levels differ between classes of samples. CEFS has been shown to outperform other methods in identifying disease-specific genes in gene expression datasets, making it a promising tool for disease diagnosis and prognosis. The proposed method was evaluated on biomedical datasets, and its effectiveness was measured in terms of accuracy, sensitivity, specificity, and area under the curve (AUC). The results showed that CEFS has comparable performance to state-of-the-art feature selection methods in the literature. Additionally, the selected features were fed to an ensemble of three classifiers, including support vector machine (SVM), k-nearest neighbor (k-NN), and decision tree (DT), to evaluate performance in the classification stage. The ensemble approach is based on majority voting, which aggregates the outputs of the individual classifiers to determine the final label. The results demonstrate the potential of CEFS in machine learning applications, particularly in the context of disease diagnosis and prognosis.

Tài liệu tham khảo

Rahman MM (2018) Gene editing: a molecular miracle Koul N, Manvi SS (2022) Feature selection from gene expression data using simulated annealing and partial least squares regression coefficients. Glob Transitions Proc citation_journal_title=Res J Pharm Technol; citation_title=A survey on feature selection methods in microarray gene expression data for cancer classification; citation_author=C Gunavathi, K Premalatha, K Sivasubramanian; citation_volume=10; citation_issue=5; citation_publication_date=2017; citation_pages=1395-1401; citation_doi=10.5958/0974-360X.2017.00249.9; citation_id=CR3 Källberg D, Vidman L, Rydén P (2021) Comparison of methods for feature selection in clustering of high-dimensional RNA-sequencing data to identify cancer subtypes. Front Genet 12 citation_journal_title=Knowledge-Based Syst; citation_title=Multi-objective feature selection based on quasi-oppositional based Jaya algorithm for microarray data; citation_author=A Chaudhuri, TP Sahu; citation_volume=236; citation_publication_date=2022; citation_doi=10.1016/j.knosys.2021.107804; citation_id=CR5 citation_journal_title=Adv Bioinformatics; citation_title=A review of feature selection and feature extraction methods applied on microarray data; citation_author=ZM Hira, DF Gillies; citation_volume=2015; citation_publication_date=2015; citation_pages=1-13; citation_doi=10.1155/2015/198363; citation_id=CR6 citation_journal_title=IEEE/ACM Trans Comput Biol Bioinforma; citation_title=Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection; citation_author=JC Ang, A Mirzal, H Haron, HNA Hamed; citation_volume=13; citation_issue=5; citation_publication_date=2016; citation_pages=971-989; citation_doi=10.1109/TCBB.2015.2478454; citation_id=CR7 citation_journal_title=Appl Soft Comput J; citation_title=Classification of DNA microarrays using artificial neural networks and ABC algorithm; citation_author=BA Garro, K Rodríguez, RA Vázquez; citation_volume=38; citation_publication_date=2016; citation_pages=548-560; citation_doi=10.1016/j.asoc.2015.10.002; citation_id=CR8 citation_journal_title=Appl Soft Comput J; citation_title=A binary ABC algorithm based on advanced similarity scheme for feature selection; citation_author=E Hancer, B Xue, D Karaboga, M Zhang; citation_volume=36; citation_publication_date=2015; citation_pages=334-348; citation_doi=10.1016/j.asoc.2015.07.023; citation_id=CR9 citation_journal_title=Expert Syst Appl; citation_title=Gene selection from microarray gene expression data for classification of cancer subgroups employing PSO and adaptive K-nearest neighborhood technique; citation_author=S Kar, K Das Sharma, M Maitra; citation_volume=42; citation_issue=1; citation_publication_date=2015; citation_pages=612-627; citation_doi=10.1016/j.eswa.2014.08.014; citation_id=CR10 citation_journal_title=Genomics; citation_title=Applying genetic programming to the prediction of alternative mRNA splice variants; citation_author=I Vukusic, SN Grellscheid, T Wiehe; citation_volume=89; citation_issue=4; citation_publication_date=2007; citation_pages=471-479; citation_doi=10.1016/j.ygeno.2007.01.001; citation_id=CR11 citation_journal_title=Genomics; citation_title=Predicting human microRNA precursors based on an optimized feature subset generated by GA-SVM; citation_author=Y Wang; citation_volume=98; citation_issue=2; citation_publication_date=2011; citation_pages=73-78; citation_doi=10.1016/j.ygeno.2011.04.011; citation_id=CR12 citation_journal_title=Genomics Proteomics Bioinforma; citation_title=ADSRPCL-SVM Approach to informative gene analysis; citation_author=W Xiong, Z Cai, J Ma; citation_volume=6; citation_issue=2; citation_publication_date=2008; citation_pages=83-90; citation_doi=10.1016/S1672-0229(08)60023-6; citation_id=CR13 citation_journal_title=Genomics Proteomics Bioinforma; citation_title=A modified ant colony optimization algorithm for tumor marker gene selection; citation_author=H Yu, G Gu, H Liu, J Shen, J Zhao; citation_volume=7; citation_issue=4; citation_publication_date=2009; citation_pages=200-208; citation_doi=10.1016/S1672-0229(08)60050-9; citation_id=CR14 citation_journal_title=Comput Biol Med; citation_title=Wrapper-based gene selection with Markov blanket; citation_author=A Wang, N An, J Yang, G Chen, L Li, G Alterovitz; citation_volume=81; citation_issue=December 2016; citation_publication_date=2017; citation_pages=11-23; citation_doi=10.1016/j.compbiomed.2016.12.002; citation_id=CR15 citation_journal_title=Int J Adv Sci Eng Inf Technol; citation_title=Improved support vector machine using multiple SVM-RFE for cancer classification; citation_author=NNM Hasri, NH Wen, CW Howe, MS Mohamad, S Deris, S Kasim; citation_volume=7; citation_issue=4–2 Special Issue; citation_publication_date=2017; citation_pages=1589-1594; citation_doi=10.18517/ijaseit.7.4-2.3394; citation_id=CR16 citation_journal_title=Sci Rep; citation_title=A hybrid gene selection method based on ReliefF and ant colony optimization algorithm for tumor classification; citation_author=L Sun, X Kong, J Xu, Z Xue, R Zhai, S Zhang; citation_volume=9; citation_issue=1; citation_publication_date=2019; citation_pages=1-14; citation_id=CR17 citation_journal_title=Knowledge-Based Syst; citation_title=A discrete bacterial algorithm for feature selection in classification of microarray gene expression cancer data; citation_author=H Wang, X Jing, B Niu; citation_volume=126; citation_publication_date=2017; citation_pages=8-19; citation_doi=10.1016/j.knosys.2017.04.004; citation_id=CR18 citation_journal_title=Comput Biol Med; citation_title=A hybrid feature selection method for DNA microarray data; citation_author=L-Y Chuang, C-H Yang, K-C Wu, C-H Yang; citation_volume=41; citation_issue=4; citation_publication_date=2011; citation_pages=228-237; citation_doi=10.1016/j.compbiomed.2011.02.004; citation_id=CR19 citation_journal_title=Knowl Inf Syst; citation_title=A two-stage gene selection scheme utilizing MRMR filter and GA wrapper; citation_author=A Akadi, A Amine, A Ouardighi, D Aboutajdine; citation_volume=26; citation_issue=3; citation_publication_date=2011; citation_pages=487-500; citation_doi=10.1007/s10115-010-0288-x; citation_id=CR20 citation_journal_title=Appl Soft Comput J; citation_title=A novel hybrid feature selection method for microarray data analysis; citation_author=CP Lee, Y Leu; citation_volume=11; citation_issue=1; citation_publication_date=2011; citation_pages=208-213; citation_doi=10.1016/j.asoc.2009.11.010; citation_id=CR21 citation_journal_title=Eng Appl Artif Intell; citation_title=Gene selection for cancer tumor detection using a novel memetic algorithm with a multi-view fitness function; citation_author=A Zibakhsh, MS Abadeh; citation_volume=26; citation_issue=4; citation_publication_date=2013; citation_pages=1274-1281; citation_doi=10.1016/j.engappai.2012.12.009; citation_id=CR22 citation_journal_title=Med Biol Eng Comput; citation_title=Genetic algorithm based cancerous gene identification from microarray data using ensemble of filter methods; citation_author=M Ghosh, S Adhikary, KK Ghosh, A Sardar, S Begum, R Sarkar; citation_volume=57; citation_issue=1; citation_publication_date=2019; citation_pages=159-176; citation_doi=10.1007/s11517-018-1874-4; citation_id=CR23 citation_journal_title=Expert Syst Appl; citation_title=Recursive memetic algorithm for gene selection in microarray data; citation_author=M Ghosh, S Begum, R Sarkar, D Chakraborty, U Maulik; citation_volume=116; citation_publication_date=2019; citation_pages=172-185; citation_doi=10.1016/j.eswa.2018.06.057; citation_id=CR24 citation_journal_title=Genomics; citation_title=Ranking analysis of microarray data: a powerful method for identifying differentially expressed genes; citation_author=Y Tan, M Fornage, YX Fu; citation_volume=88; citation_issue=6; citation_publication_date=2006; citation_pages=846-854; citation_doi=10.1016/j.ygeno.2006.08.003; citation_id=CR25 citation_journal_title=Genomics; citation_title=Ranking analysis for identifying differentially expressed genes; citation_author=Y Qi, H Sun, Q Sun, L Pan; citation_volume=97; citation_issue=5; citation_publication_date=2011; citation_pages=326-329; citation_doi=10.1016/j.ygeno.2011.03.002; citation_id=CR26 Xu J, Xu T, Sun L, Ren J (2013) An improved correlation measure-based SOM clustering algorithm for gene selection. J Softw 8(12) Bennet J, Arul Ganaprakasam C, Arputharaj K (2014) A discrete wavelet based feature extraction and hybrid classification technique for microarray data analysis. Sci World J 2014 citation_journal_title=Genomics; citation_title=Robust and stable gene selection via Maximum-Minimum Correntropy Criterion; citation_author=M Mohammadi, H Sharifi Noghabi, G Abed Hodtani, H Rajabi Mashhadi; citation_volume=107; citation_issue=2–3; citation_publication_date=2016; citation_pages=83-87; citation_doi=10.1016/j.ygeno.2015.12.006; citation_id=CR29 citation_journal_title=Chemom Intell Lab Syst; citation_title=Feature selection and classification for gene expression data using novel correlation based overlapping score method via Chou’s 5-steps rule; citation_author=A Wahid; citation_volume=199; citation_publication_date=2020; citation_doi=10.1016/j.chemolab.2020.103958; citation_id=CR30 citation_journal_title=BMC Bioinformatics; citation_title=Adaptive filtering of microarray gene expression data based on Gaussian mixture decomposition; citation_author=M Marczyk, R Jaksik, A Polanski, J Polanska; citation_volume=14; citation_issue=1; citation_publication_date=2013; citation_pages=101; citation_doi=10.1186/1471-2105-14-101; citation_id=CR31 citation_journal_title=Genomics; citation_title=Receiver operating characteristic analysis: a general tool for DNA array data filtration and performance estimation; citation_author=NN Khodarev; citation_volume=81; citation_issue=2; citation_publication_date=2003; citation_pages=202-209; citation_doi=10.1016/S0888-7543(02)00042-3; citation_id=CR32 citation_journal_title=Bioinformatics; citation_title=I/NI-calls for the exclusion of non-informative genes: a highly effective filtering tool for microarray data; citation_author=W Talloen; citation_volume=23; citation_issue=21; citation_publication_date=2007; citation_pages=2897-2902; citation_doi=10.1093/bioinformatics/btm478; citation_id=CR33 citation_journal_title=Genes (Basel); citation_title=The cross-entropy based multi-filter ensemble method for gene selection; citation_author=Y Sun, C Lu, X Li; citation_volume=9; citation_issue=5; citation_publication_date=2018; citation_pages=258; citation_doi=10.3390/genes9050258; citation_id=CR34 Zhang H (2021) Feature selection using approximate conditional entropy based on fuzzy information granule for gene expression data classification. Front Genet 12 citation_journal_title=Neurocomputing; citation_title=An efficient gene selection algorithm based on mutual information; citation_author=R Cai, Z Hao, X Yang, W Wen; citation_volume=72; citation_issue=4–6; citation_publication_date=2009; citation_pages=991-999; citation_doi=10.1016/j.neucom.2008.04.005; citation_id=CR36 citation_journal_title=Genomics Proteomics Bioinforma; citation_title=Gene expression data classification using consensus independent component analysis; citation_author=CH Zheng, DS Huang, XZ Kong, XM Zhao; citation_volume=6; citation_issue=2; citation_publication_date=2008; citation_pages=74-82; citation_doi=10.1016/S1672-0229(08)60022-4; citation_id=CR37 citation_journal_title=Genomics Proteomics Bioinforma; citation_title=A modified t-test feature selection method and its application on the hapmap genotype data; citation_author=N Zhou, L Wang; citation_volume=5; citation_issue=3–4; citation_publication_date=2007; citation_pages=242-249; citation_doi=10.1016/S1672-0229(08)60011-X; citation_id=CR38 citation_journal_title=Genomics Proteomics Bioinforma; citation_title=Fuzzy logic for elimination of redundant information of microarray data; citation_author=EB Huerta, B Duval, JK Hao; citation_volume=6; citation_issue=2; citation_publication_date=2008; citation_pages=61-73; citation_doi=10.1016/S1672-0229(08)60021-2; citation_id=CR39 citation_journal_title=IEEE/ACM Trans Comput Biol Bioinforma; citation_title=A survey on filter techniques for feature selection in gene expression microarray analysis; citation_author=C Lazar; citation_volume=9; citation_issue=4; citation_publication_date=2012; citation_pages=1106-1119; citation_doi=10.1109/TCBB.2012.33; citation_id=CR40 citation_journal_title=Math Biosci; citation_title=Identification of potential biomarkers on microarray data using distributed gene selection approach; citation_author=AK Shukla, D Tripathi; citation_volume=315; citation_issue=June; citation_publication_date=2019; citation_pages=108230; citation_doi=10.1016/j.mbs.2019.108230; citation_id=CR41 citation_journal_title=ETRI J; citation_title=An enhanced feature selection filter for classification of microarray cancer data; citation_author=DH Mazumder, R Veilumuthu; citation_volume=41; citation_issue=3; citation_publication_date=2019; citation_pages=358-370; citation_doi=10.4218/etrij.2018-0522; citation_id=CR42 citation_journal_title=Comput Intell; citation_title=A novel dissimilarity metric based on feature-to-feature scatter frequencies for clustering-based feature selection in biomedical data; citation_author=G Sheikhi, H Altınçay; citation_volume=37; citation_issue=4; citation_publication_date=2021; citation_pages=1865-1889; citation_doi=10.1111/coin.12470; citation_id=CR43 citation_journal_title=Pattern Recognit; citation_title=Markov blanket-embedded genetic algorithm for gene selection; citation_author=Z Zhu, YS Ong, M Dash; citation_volume=40; citation_issue=11; citation_publication_date=2007; citation_pages=3236-3248; citation_doi=10.1016/j.patcog.2007.02.007; citation_id=CR44