Random forest versus logistic regression: a large-scale benchmark experiment

BMC Bioinformatics - Tập 19 - Trang 1-14 - 2018

Raphael Couronné¹, Philipp Probst¹, Anne-Laure Boulesteix¹

¹Department of Medical Information Processing, Biometry and Epidemiology, LMU Munich, Munich, Germany

Tóm tắt

The Random Forest (RF) algorithm for regression and classification has considerably gained popularity since its introduction in 2001. Meanwhile, it has grown to a standard classification approach competing with logistic regression in many innovation-friendly scientific fields. In this context, we present a large scale benchmarking experiment based on 243 real datasets comparing the prediction performance of the original version of RF with default parameters and LR as binary classification tools. Most importantly, the design of our benchmark experiment is inspired from clinical trial methodology, thus avoiding common pitfalls and major sources of biases. RF performed better than LR according to the considered accuracy measured in approximately 69% of the datasets. The mean difference between RF and LR was 0.029 (95%-CI =[0.022,0.038]) for the accuracy, 0.041 (95%-CI =[0.031,0.053]) for the Area Under the Curve, and − 0.027 (95%-CI =[−0.034,−0.021]) for the Brier score, all measures thus suggesting a significantly better performance of RF. As a side-result of our benchmarking experiment, we observed that the results were noticeably dependent on the inclusion criteria used to select the example datasets, thus emphasizing the importance of clear statements regarding this dataset selection process. We also stress that neutral studies similar to ours, based on a high number of datasets and carefully designed, will be necessary in the future to evaluate further variants, implementations or parameters of random forests which may yield improved accuracy compared to the original version with default values.

Tài liệu tham khảo

Shmueli G. To explain or to predict?Stat Sci. 2010; 25:289–310. Breiman L. Random forests. Mach Learn. 2001; 45(1):5–32. Liaw A, Wiener M. Classification and regression by randomforest. R News. 2002; 2:18–22. Probst P. tuneRanger: Tune Random Forest of the ’ranger’ Package. 2018. R package version 0.1. Boulesteix A-L, Lauer S, Eugster MJ. A plea for neutral comparison studies in computational sciences. PLoS ONE. 2013; 8(4):61562. De Bin R, Janitza S, Sauerbrei W, Boulesteix A-L. Subsampling versus bootstrapping in resampling-based model selection for multivariable regression. Biometrics. 2016; 72:272–80. Boulesteix A-L, De Bin R, Jiang X, Fuchs M. IPF-LASSO: integrative L1-penalized regression with penalty factors for prediction based on multi-omics data. Comput Math Models Med. 2017. https://doi.org/10.1155/2017/7691937. Boulesteix A-L, Bender A, Bermejo JL, Strobl C. Random forest gini importance favours snps with large minor allele frequency: impact, sources and recommendations. Brief Bioinform. 2012; 13(3):292–304. Boulesteix A-L, Schmid M. Machine learning versus statistical modeling. Biom J. 2014; 56(4):588–93. Boulesteix A-L, Janitza S, Hornung R, Probst P, Busen H, Hapfelmeier A. Making complex prediction rules applicable for readers: Current practice in random forest literature and recommendations. Biometrical J. 2016. In press. Boulesteix A-L, Wilson R, Hapfelmeier A. Towards evidence-based computational statistics: lessons from clinical research on the role and design of real-data benchmark studies. BMC Med Res Methodol. 2017; 17(1):138. Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001; 29:1189–232. Hothorn T, Hornik K, Zeileis A. Unbiased recursive partitioning: A conditional inference framework. J Comput Graph Stat. 2006; 15:651–74. Strobl C, Boulesteix A-L, Zeileis A, Hothorn T. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics. 2007; 8:25. Geurts P, Ernst D, Wehenkel L. Extremely randomized trees. Mach Learn. 2006; 63(1):3–42. Boulesteix A-L, Janitza S, Kruppa J, König IR. Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdiscip Rev Data Min Knowl Discov. 2012; 2(6):493–507. Huang BF, Boutros PC. The parameter sensitivity of random forests. BMC Bioinformatics. 2016; 17:331. Probst P, Boulesteix A-L. To tune or not to tune the number of trees in random forest. J Mach Learn Res. 2018; 18(181):1–18. Probst P, Bischl B, Boulesteix A-L. Tunability: Importance of hyperparameters of machine learning algorithms. 2018. arXiv preprint. https://arxiv.org/abs/1802.09596. Probst P, Wright M, Boulesteix A-L. Hyperparameters and Tuning Strategies for Random Forest. 2018. ArXiv preprint. https://arxiv.org/abs/1804.03515. Bischl B, Mersmann O, Trautmann H, Weihs C. Resampling methods for meta-model validation with recommendations for evolutionary computation. Evol Comput. 2012; 20(2):249–75. Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, Pencina MJ, Kattan MW. Assessing the performance of prediction models: a framework for some traditional and novel measures. Epidemiology. 2010; 21(1):128. Rufibach K. Use of brier score to assess binary predictions. J Clin Epidemiol. 2010; 63(8):938–9. Lichman M. UCI Machine Learning Repository. 2013. http://archive.ics.uci.edu/ml. Accessed 4 July 2018. Brazma A, Parkinson H, Sarkans U, Shojatalab M, Vilo J, Abeygunawardena N, Holloway E, Kapushesky M, Kemmeren P, Lara GG, et al. Arrayexpress—a public repository for microarray gene expression data at the EBI. Nucleic Acids Res. 2003; 31:68–71. Vanschoren J, Van Rijn JN, Bischl B, Torgo L. OpenML: networked science in machine learning. ACM SIGKDD Explor Newsl. 2014; 15(2):49–60. Yousefi MR, Hua J, Sima C, Dougherty ER. Reporting bias when using real data sets to analyze classification performance. Bioinformatics. 2010; 26(1):68–76. Boulesteix A-L. Ten simple rules for reducing overoptimistic reporting in methodological computational research. PLoS Comput Biol. 2015; 11(4):1004191. Giraud-Carrier C, Vilalta R, Brazdil P. Introduction to the special issue on meta-learning. Mach Learn. 2004; 54(3):187–93. Jong VL, Novianti PW, Roes KC, Eijkemans MJ. Selecting a classification function for class prediction with gene expression data. Bioinformatics. 2016; 32:1814–22. Boulesteix A-L, Hable R, Lauer S, Eugster MJ. A statistical framework for hypothesis testing in real data comparison studies. Am Stat. 2015; 69(3):201–12. Bischl B, Lang M, Kotthoff L, Schiffner J, Richter J, Jones Z, Casalicchio G. Mlr: Machine Learning in R. 2016. R package version 2.10. https://github.com/mlr-org/mlr. Casalicchio G, Bischl B, Kirchhoff D, Lang M, Hofner B, Bossek J, Kerschke P, Vanschoren J. OpenML: Exploring Machine Learning Better, Together. 2016. R package version 1.0. https://github.com/openml/openml-r. Lang M, Bischl B, Surmann D. batchtools: Tools for R to work on batch systems. J Open Source Softw. 2017;2(10). https://doi.org/10.21105/joss.00135. Couronné R, Probst P. 2017. https://doi.org/10.5281/zenodo.439090https://doi.org/10.5281/zenodo.439090. Couronné R, Probst P. Docker image: Benchmarking random forest: a large- scale experiment. 2017. https://doi.org/10.5281/zenodo.804427. Boettiger C. An introduction to docker for reproducible research. SIGOPS Oper Syst Rev. 2015; 49(1):71–9. https://doi.org/10.1145/2723872.2723882. Davison AC, Hinkley DV. Bootstrap Methods and Their Application. Cambridge: Cambridge University Press; 1997. Muchlinski D, Siroky D, He J, Kocher M. Comparing random forest with logistic regression for predicting class-imbalanced civil war onset data. Polit Anal. 2015; 24(1):87–103. Cummings MP, Myers DS. Simple statistical models predict C-to-U edited sites in plant mitochondrial RNA. BMC Bioinform. 2004; 5(1):132. BioMed Central. Breiman L. Statistical modeling: The two cultures (with comments and a rejoinder by the author). Stat Sci. 2001; 16(3):199–231.

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA