Compensation of feature selection biases accompanied with improved predictive performance for binary classification by using a novel ensemble feature selection approach

BioData Mining - Tập 9 Số 1 - 2016
Ursula Neumann1, Mona Riemenschneider1, Jan-Peter Sowa2, Theodor Baars3, Julia Kälsch2, Ali Canbay2, Dominik Heider4
1Department of Bioinformatics, Straubing, 94315, Germany
2Department of Gastroenterology and Hepatology, University Hospital, University Duisburg-Essen, Essen, 45122, Germany
3Clinic for Cardiology, West German Heart and Vascular Centre Essen, University Hospital, University Duisburg-Essen, Essen, 45122, Germany
4Wissenschaftszentrum Weihenstephan, Technische Universität München, Freising, 85354, Germany

Tóm tắt

Từ khóa


Tài liệu tham khảo

Biomarkers Definitions Working Group. Biomarkers and surrogate endpoints: preferred definitions and conceptual framework. Clin Pharmacol Therap. 2001; 69(3):89–95.

Hall M. Correlation-based feature selection for machine learning. 1999. PhD thesis, Department of Computer Science, Waikato University, New Zealand.

Peng H, Long F, Ding C. Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell. 2005; 27(8):1226–38.

Kohavi R, John GH. Wrappers for feature subset selection. Artif Intell. 1997; 97:273–324.

Jain A, Zongker D. Feature selection: evaluation, application, and small sample performance. IEEE Trans Pattern Anal Mach Intell. 1997; 19(2):153–8.

He Z, Yu W. Stable feature selection for biomarker discovery. Comput Biol Chem. 2010; 34:215–25.

Sandri M, Zuccolotto P. A bias correction algorithm for the gini variable importance measure in classification trees. J Comput Graph Stat. 2008; 17(3):611–28.

Boulesteix AL, Janitza S, Kruppa J, KÃűnig IR. Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdisciplinary Rev Data Mining Knowl Discov. 2012; 2(6):493–507.

Kuncheva LI. Combining Pattern Classifiers: Methods and Algorithms. Hoboken: John Wiley & Sons; 2004.

Breiman L. Random forests. Mach Learn. 2001; 45(1):5–32.

Heider D, Hauke S, Pyka M, Kessler D. Insights into the classification of small gtpases. Adv Appl Bioinform Chem. 2010; 3:15–24.

van den Boom J, Heider D, Martin SR, Pastore A, Mueller JW. 3-phosphoadenosine 5-phosphosulfate (paps) synthases, naturally fragile enzymes specifically stabilized by nucleotide binding. J Biol Chem. 2012; 287(21):17645–55.

Touw WG, Bayjanov JR, Overmars L, Backus L, Boekhorst J, Wels M, van Hijum SA. Data mining in the life sciences with random forest: a walk in the park or lost in the jungle?Brief Bioinform. 2013; 14(3):315–26.

Dybowski JN, Riemenschneider M, Hauke S, Pyka M, Verheyen J, Hoffmann D, Heider D. Improved bevirimat resistance prediction by combination of structural and sequence-based classifiers. BioData Mining. 2011; 4:26.

Riemenschneider M, Senge R, Neumann U, Hüllermeier E, Heider D. Exploiting hiv-1 protease and reverse transcriptase cross-resistance information for improved drug resistance prediction by means of multi-label classification. BioData Mining. 2016; 9:10.

Hothorn T, Hornik K, Zeileis A. Party: A Laboratory for Recursive Part(y)itioning. http://CRAN.R-project.org/ .

Calle M, Urrea V, Boulesteix LA, Malats N. Auc-rf: A new strategy for genomic profiling with random forest. Hum Heredity. 2011; 72(2):121–32.

Janitza S, Strobl C, Boulesteix AL. An auc-based permutation variable importance measure for random forests. BMC Bioinformatics. 2013; 14:119.

Yu L, Liu H. Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res. 2004; 5:1205–24.

Dormann CF, Elith J, Bacher S, Buchmann C, Carl G, Carr G, Marquz JRG, Gruber B, Lafourcade B, Leito PJ, Mnkemller T, McClean C, Osborne PE, Reineking B, Schrder B, Skidmore AK, Zurell D, Lautenbach S. Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography. 2013; 36(1):27–46.

Suzuki N, Olson DH, Reilly EC. Developing landscape habitat models for rare amphibians with small geographic ranges: a case study of siskiyou mountains salamanders in the western usa. Biodiversity Conserv. 2008; 17:2197–218.

Elith J, Graham CH, Anderson RP, Dudk M, Ferrier S, Guisan A, Hijmans RJ, Huettmann F, Leathwick JR, Lehmann A, Li J, Lohmann LG, Loiselle BA, Manion G, Moritz C, Nakamura M, Nakazawa Y, Overton JM, Townsend Peterson A, Phillips SJ, Richardson K, Scachetti-Pereira R, Schapire RE, Sobern J, Williams S, Wisz MS, Zimmermann NE. Novel methods improve prediction of species distributions from occurrence data. Ecography. 2006; 29(2):129–51.

Bauer DF. Constructing confidence sets using rank statistics. J Am Stat Assoc. 1972; 67:687–90.

Breiman L. Bagging predictors. Mach Learn. 1996; 24(2):123–40.

Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci. 1997; 55(1):119–39.

Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer; 2011.

Sing T, Sander O, Beerenwinkel N, Lengauer T. Rocr: visualizing classifier performance in r. Bioinformatics. 2005; 21(20):3940–1.

DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics. 1988; 44(3):837–45.

Baars T, Neumann U, Jinawy M, Hendricks S, Sowa JP, Klsch J, Riemenschneider M, Gerken G, Erbel R, Heider D, Canbay A. In acute myocardial infarction liver parameters are associated with stenosis diameter. Medicine. 2016; 95(6):2807.

Sowa JP, Heider D, Bechmann LP, Gerken G, Hoffmann D, Canbay A. Novel algorithm for non-invasive assessment of fibrosis in nafld. PLOS ONE. 2013; 8(4):62439.

Lichman M. UCI Machine Learning Repository. 2013. http://archive.ics.uci.edu/ml .

Strobl C, Boulesteix AL, Zeileis A, Hothorn T. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics. 2007; 8(25):1–21.

Barbosa E, Rttger R, Hauschild AC, Azevedo V, Baumbach J. On the limits of computational functional genomics for bacterial lifestyle prediction. Brief Funct Genomics. 2014; 13:398–408.

Sowa JP, Atmaca, Kahraman A, Schlattjan M, Lindner M, Sydor S, Scherbaum N, Lackner K, Gerken G, Heider D, Arteel GE, Erim Y, Canbay A. Non-invasive separation of alcoholic and non-alcoholic liver disease with predictive modeling. PLOS ONE. 2014; 9(7):101444.

Kalousis A, Prados J, Hilario M. Stability of feature selection algorithms: a study on high-dimensional spaces. Knowl Inform Syst. 2007; 12:95–116.

Mucciardi AN, Gose EE. A comparison of seven techniques for choosing subsets of pattern recognition properties. IEEE Trans Comput. 1971; 9:1971910231031.

Almuallim H, Dietterich TG. Learning with many irrelevant features. 1991:547–552. Proceedings of the Ninth National Conference on Artificial Intelligence,San Jose, CA: AAAI Press.

Doak J. An evaluation of feature-selection methods and their application to computer security (technical report cse-92-18). Davis: University California, Department of Computer Science. 1992.

Caruana RA, Freitag D. Greedy attribute selection. In: Proceedings of the Eleventh Inter-national Conference on Machine Learning. New Brunswick, NJ: Morgan Kaufmann: 1994. p. 28–36.

Kononenko I. On biases in estimating multi-valued attributes. Montreal; 1995. pp. 1034–1040.

Blum AL, Langleyb P. Selection of relevant features and examples in machine learning. Artif Intell. 1997; 97(1–2):245–71.

Abeel T, Helleputte T, Van de Peer Y, Dupont P, Saeys Y. Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics. 2010; 26(3):392–8.

Piao Y, Piao M, Park K, Ho Ryu K. An ensemble correlation-based gene selection algorithm for cancer classification with gene expression data. Bioinformatics. 2012; 28(24):3306–15.

van Landeghem S, Abeel T, Saeys Y, van de Peer Y. Discriminative and informative features for biomolecular text mining with ensemble feature selection. Bioinformatics. 2010; 26(18):554–60.

Bagley SC, Whiteb H, Golomb BA. Logistic regression in the medical literature:standards for use and reporting, with particular attention to one medical domain. J Clin Epidemiol. 2001; 54:979–85.