A feature selection method based on multiple kernel learning with expression profiles of different types
Tóm tắt
With the development of high-throughput technology, the researchers can acquire large number of expression data with different types from several public databases. Because most of these data have small number of samples and hundreds or thousands features, how to extract informative features from expression data effectively and robustly using feature selection technique is challenging and crucial. So far, a mass of many feature selection approaches have been proposed and applied to analyse expression data of different types. However, most of these methods only are limited to measure the performances on one single type of expression data by accuracy or error rate of classification. In this article, we propose a hybrid feature selection method based on Multiple Kernel Learning (MKL) and evaluate the performance on expression datasets of different types. Firstly, the relevance between features and classifying samples is measured by using the optimizing function of MKL. In this step, an iterative gradient descent process is used to perform the optimization both on the parameters of Support Vector Machine (SVM) and kernel confidence. Then, a set of relevant features is selected by sorting the optimizing function of each feature. Furthermore, we apply an embedded scheme of forward selection to detect the compact feature subsets from the relevant feature set. We not only compare the classification accuracy with other methods, but also compare the stability, similarity and consistency of different algorithms. The proposed method has a satisfactory capability of feature selection for analysing expression datasets of different types using different performance measurements.
Tài liệu tham khảo
Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M, et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 2013;41(Database issue):D991–5.
Hubble J, Demeter J, Jin H, Mao M, Nitzberg M, Reddy TBK, Wymore F, Zachariah K, Sherlock G, Ball CA. Implementation of GenePattern within the Stanford Microarray Database. Nucleic Acids Res. 2009;37:D898–901.
Rustici G, Kolesnikov N, Brandizi M, Burdett T, Dylag M, Emam I, Farne A, Hastings E, Ison J, Keays M, et al. ArrayExpress update—trends in database growth and links to data analysis tools. Nucleic Acids Res. 2013;41(Database issue):D987–90.
Cancer Genome Atlas Research N, Weinstein JN, Collisson EA, Mills GB, Shaw KR, Ozenberger BA, Ellrott K, Shmulevich I, Sander C, Stuart JM. The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet. 2013;45(10):1113–20.
Xu Y, Cui J, Puett D. Cancer Bioinformatics. New York: Springer; 2014: 43.
Kim Y, Street WN, Menczer F. Feature Selection in Data Mining. In: Data Mining: Opportunities and Challenges. Hershey: Idea Group Publishing; 2003: 80-105.
Saeys Y, Inza I, Larranaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007;23(19):2507–17.
Tang Y, Zhang YQ, Huang Z. Development of two-stage SVM-RFE gene selection strategy for microarray expression data analysis. IEEE/ACM Trans Comput Biol Bioinform. 2007;4(3):365–81.
Glaab E, Garibaldi JM, Krasnogor N. ArrayMining: a modular web-application for microarray analysis combining ensemble and consensus methods with cross-study normalization. BMC Bioinformatics. 2009;10:358.
Cai H, Ruan P, Ng M, Akutsu T. Feature weight estimation for gene selection: a local hyperlinear learning approach. BMC Bioinformatics. 2014;15:70.
Cao ZB, Wang Y, Sun Y, Du W, Liang YC. A novel filter feature selection method for paired microarray expression data analysis. Int J Data Min Bioinform. 2015;12(4):363–86.
Mukhopadhyay A, Maulik U. An SVM-wrapped multiobjective evolutionary feature selection approach for identifying cancer-microRNA markers. IEEE Trans Nanobioscience. 2013;12(4):275–81.
Maulik U, Chakraborty D. Fuzzy preference based feature selection and semisupervised SVM for cancer classification. IEEE Trans Nanobioscience. 2014;13(2):152–60.
Chen Z, Li J, Wei L. A multiple kernel support vector machine scheme for feature selection and rule extraction from gene expression data of cancer tissue. Artif Intell Med. 2007;41(2):161–75.
Mao Q, Tsang IW. A feature selection method for multivariate performance measures. IEEE Trans Pattern Anal Mach Intell. 2013;35(9):2051–63.
Li Y, Si J, Zhou G, Huang S, Chen S. FREL: A Stable Feature Selection Algorithm. IEEE Trans Neural Netw Learn Syst. 2015;26(7):1388-402
Kursa MB. Robustness of Random Forest-based gene selection methods. BMC Bioinformatics. 2014;15:8.
Yousef M, Jung S, Showe LC, Showe MK. Recursive cluster elimination (RCE) for classification and feature selection from gene expression data. BMC Bioinformatics. 2007;8:144.
Niijima S, Okuno Y. Laplacian linear discriminant analysis approach to unsupervised feature selection. IEEE/ACM Trans Comput Biol Bioinform. 2009;6(4):605–14.
Chuang LY, Ke CH, Chang HW, Yang CH. A two-stage feature selection method for gene expression data. OMICS. 2009;13(2):127–37.
Mundra PA, Rajapakse JC. SVM-RFE with MRMR filter for gene selection. IEEE Trans Nanobioscience. 2010;9(1):31–7.
Du W, Sun Y, Wang Y, Cao ZB, Zhang C, Liang YC. A novel multi-stage feature selection method for microarray expression data analysis. Int J Data Min Bioinform. 2013;7(1):58–77.
Rakotomamonjy A, Bach FR, Canu S, Grandvalet Y. SimpleMKL. J Mach Learn Res. 2008;9:2491–521.
Gonen M, Alpaydin E. Multiple Kernel Learning Algorithms. J Mach Learn Res. 2011;12:2211–68.
Ding C, Peng H. Minimum redundancy feature selection from microarray gene expression data. J Bioinforma Comput Biol. 2005;3(2):185–205.
Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Mach Learn. 2002;46(1–3):389–422.
Gutkin M, Shamir R, Dror G. SlimPLS: a method for feature selection in gene expression-based disease classification. PloS One. 2009;4(7):e6416.
Yoon D, Lee EK, Park T. Robust imputation method for missing values in microarray data. BMC Bioinformatics. 2007;8:S6.
Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 2002;30(4):e15.
Autio R, Kilpinen S, Saarela M, Kallioniemi O, Hautaniemi S, Astola J. Comparison of Affymetrix data normalization methods using 6,926 experiments across five array generations. BMC Bioinformatics. 2009;10:S24.
Peng H, Long F, Ding C. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell. 2005;27(8):1226–38.
Wu X, Yu K, Ding W, Wang H, Zhu X. Online feature selection with streaming features. IEEE Trans Pattern Anal Mach Intell. 2013;35(5):1178–92.
Tan MK, Tsang IW, Wang L. Towards Ultrahigh Dimensional Feature Selection for Big Data. J Mach Learn Res. 2014;15:1371–429.
Haury AC, Gestraud P, Vert JP. The Influence of Feature Selection Methods on Accuracy, Stability and Interpretability of Molecular Signatures. PloS One. 2011;6(12):e28210.
Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97.
Seoane JA, Day INM, Gaunt TR, Campbell C. A pathway-based data integration framework for prediction of disease progression. Bioinformatics. 2014;30(6):838–45.
Tirosh I, Izar B, Prakadan SM, Wadsworth MH, Treacy D, Trombetta JJ, Rotem A, Rodman C, Lian C, Murphy G, et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science. 2016;352(6282):189–96.