An experimental study of the intrinsic stability of random forest variable importance measures

Huazhen Wang1,2, Fan Yang3, Zhiyuan Luo1
1Computer Learning Research Centre, Royal Holloway, University of London, Surrey, UK
2College of Computer Science and Technology, HuaQiao University, Xiamen, China
3Automation Department, Xiamen University, Xiamen, China

Tóm tắt

Abstract Background

The stability of Variable Importance Measures (VIMs) based on random forest has recently received increased attention. Despite the extensive attention on traditional stability of data perturbations or parameter variations, few studies include influences coming from the intrinsic randomness in generating VIMs, i.e. bagging, randomization and permutation. To address these influences, in this paper we introduce a new concept of intrinsic stability of VIMs, which is defined as the self-consistence among feature rankings in repeated runs of VIMs without data perturbations and parameter variations. Two widely used VIMs, i.e., Mean Decrease Accuracy (MDA) and Mean Decrease Gini (MDG) are comprehensively investigated. The motivation of this study is two-fold. First, we empirically verify the prevalence of intrinsic stability of VIMs over many real-world datasets to highlight that the instability of VIMs does not originate exclusively from data perturbations or parameter variations, but also stems from the intrinsic randomness of VIMs. Second, through Spearman and Pearson tests we comprehensively investigate how different factors influence the intrinsic stability.

Results

The experiments are carried out on 19 benchmark datasets with diverse characteristics, including 10 high-dimensional and small-sample gene expression datasets. Experimental results demonstrate the prevalence of intrinsic stability of VIMs. Spearman and Pearson tests on the correlations between intrinsic stability and different factors show that #feature (number of features) and #sample (size of sample) have a coupling effect on the intrinsic stability. The synthetic indictor, #feature/#sample, shows both negative monotonic correlation and negative linear correlation with the intrinsic stability, while OOB accuracy has monotonic correlations with intrinsic stability. This indicates that high-dimensional, small-sample and high complexity datasets may suffer more from intrinsic instability of VIMs. Furthermore, with respect to parameter settings of random forest, a large number of trees is preferred. No significant correlations can be seen between intrinsic stability and other factors. Finally, the magnitude of intrinsic stability is always smaller than that of traditional stability.

Conclusion

First, the prevalence of intrinsic stability of VIMs demonstrates that the instability of VIMs not only comes from data perturbations or parameter variations, but also stems from the intrinsic randomness of VIMs. This finding gives a better understanding of VIM stability, and may help reduce the instability of VIMs. Second, by investigating the potential factors of intrinsic stability, users would be more aware of the risks and hence more careful when using VIMs, especially on high-dimensional, small-sample and high complexity datasets.

Từ khóa


Tài liệu tham khảo

Breiman L. Random forests. Mach Learn. 2001; 45(1):5–32.

Reif DM, Motsinger AA, McKinney BA, Crowe JE, Moore JH. Feature selection using a random forests classifier for the integrated analysis of multiple data types. In: Computational Intelligence and Bioinformatics and Computational Biology, 2006. CIBCB’06. 2006 IEEE Symposium On. Toronto, Canada: IEEE: 2006. p. 1–8.

Díaz-Uriarte R, De Andres SA. Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006; 7(1):3.

Okun O, Priisalu H. Random forest for gene expression based cancer classification: overlooked issues. In: Pattern Recognition and Image Analysis. Girona, Spain: Springer: 2007. p. 483–90.

Statnikov A, Wang L, Aliferis CF. A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics. 2008; 9(1):319.

Boulesteix AL, Janitza S, Kruppa J, König IR. Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdiscip Rev: Data Min Knowl Discov. 2012; 2(6):493–507.

Lee SS, Sun L, Kustra R, Bull SB. Em-random forest and new measures of variable importance for multi-locus quantitative trait linkage analysis. Bioinformatics. 2008; 24(14):1603–10.

Altmann A, Toloşi L, Sander O, Lengauer T. Permutation importance: a corrected feature importance measure. Bioinformatics. 2010; 26(10):1340–7.

Ma D, Xiao J, Li Y, Diao Y, Guo Y, Li M. Feature importance analysis in guide strand identification of micrornas. Comput Biol Chem. 2011; 35(3):131–6.

Cao DS, Liang YZ, Xu QS, Zhang LX, Hu QN, Li HD. Feature importance sampling-based adaptive random forest as a useful tool to screen underlying lead compounds. J Chemometrics. 2011; 25(4):201–7.

Paul J, Verleysen M, Dupont P. Identification of statistically significant features from random forests. In: ECML Workshop on Solving Complex Machine Learning Problems with Ensemble Methods. Prague, Czech Republic: Springer: 2013.

Yu L, Ding C, Loscalzo S. Stable feature selection via dense feature groups. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Las Vegas, Nevada, USA: ACM: 2008. p. 803–11.

Loscalzo S, Yu L, Ding C. Consensus group stable feature selection. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Paris, France: ACM: 2009. p. 567–76.

He Z, Yu W. Stable feature selection for biomarker discovery. Comput Biol Chem. 2010; 34(4):215–25.

Yu L, Han Y, Berens ME. Stable gene selection from microarray data via sample weighting. IEEE/ACM Trans Comput Biol Bioinformatics (TCBB). 2012; 9(1):262–72.

Han Y, Yu L. A variance reduction framework for stable feature selection. Stat Anal Data Min: The ASA Data Science Journal. 2012; 5(5):428–45.

Kamkar I, Gupta SK, Phung D, Venkatesh S. Stable feature selection for clinical prediction: Exploiting icd tree structure using tree-lasso. Journal of biomedical informatics. 2014; 53:1532–0464.

Park CH, Kim SB. Sequential random k-nearest neighbor feature selection for high-dimensional data. Expert Syst Appl. 2015; 42(5):2336–42.

Kalousis A, Prados J, Hilario M. Stability of feature selection algorithms: a study on high-dimensional spaces. Knowl Inform Syst. 2007; 12(1):95–116.

Haury AC, Gestraud P, Vert JP. The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures. PloS one. 2011; 6(12):28210.

Kim SY. Effects of sample size on robustness and prediction accuracy of a prognostic gene signature. BMC Bioinformatics. 2009; 10(1):147.

Calle ML, Urrea V. Letter to the editor: Stability of random forest importance measures. Brief Bioinformatics. 2011; 12(1):86–9.

Nicodemus KK. Letter to the editor: On the stability and ranking of predictors from random forest variable importance measures. Briefings in bioinformatics. 2011; 12(4):369–73.

Verikas A, Gelzinis A, Bacauskiene M. Mining data with random forests: A survey and results of new tests. Pattern Recognit. 2011; 44(2):330–49.

Kursa MB. Robustness of random forest-based gene selection methods. BMC Bioinformatics. 2014; 15(1):8.

Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Mach Learn. 2002; 46(1–3):389–422.

Zhang Y, Ding C, Li T. Gene selection algorithm by combining relieff and mrmr. BMC Genomics. 2008; 9(Suppl 2):27.

Wang H, Wang C, Lv B, Pan X. Improved variable importance measure of random forest via combining of proximity measure and support vector machine for stable feature selection. J Inform Comput Sci. 2015; 12(8):3241–52. doi:10.12733/jics20105854.

Boulesteix AL, Bender A, Bermejo JL, Strobl C. Brief Bioinform. 2012; 13(3):292–304.

Genuer R. Variance reduction in purely random forests. J Nonparametric Stat. 2012; 24(3):543–62.

Cadenas JM, Garrido MC, MartíNez R. Feature subset selection filter–wrapper based on low quality data. Expert Syst Appl. 2013; 40(16):6241–52.

Kulkarni VY, Sinha PK. Random forest classifiers: a survey and future research directions. Int J Adv Comput. 2013; 36(1):1144–53.

Kuncheva LI. A stability index for feature selection. In: Artificial Intelligence and Applications. Innsbruck, Austria: Springer: 2007. p. 421–7.

Alelyani S, Zhao Z, Liu H. A dilemma in assessing stability of feature selection algorithms. In: High Performance Computing and Communications (HPCC), 2011 IEEE 13th International Conference On. Banff, Canada: IEEE: 2011. p. 701–7.

Fagin R, Kumar R, Sivakumar D. Comparing top k lists. SIAM J Discrete Math. 2003; 17(1):134–60.

Boulesteix AL, Slawski M. Stability and aggregation of ranked gene lists. Brief Bioinformatics. 2009; 10(5):556–68.

Fieller EC, Hartley HO, Pearson ES. Tests for rank correlation coefficients. i.Biometrika. 1957; 44:470–481.

Hamers L, Hemeryck Y, Herweyers G, Janssen M, Keters H, Rousseau R, et al. Similarity measures in scientometric research: the jaccard index versus salton’s cosine formula. Inform Process Manag. 1989; 25(3):315–8.

Pleus S, Schmid C, Link M, Zschornack E, Klötzer HM, Haug C, et al. Performance evaluation of a continuous glucose monitoring system under conditions similar to daily life. J Diabetes Sci Technol. 2013; 7(4):833–41.

Statnikov A, Tsamardinos I, Dosbayev Y, Aliferis CF. Gems: a system for automated cancer diagnosis and biomarker discovery from microarray gene expression data. Int J Med Inform. 2005; 74(7):491–503.

Ho TK. A data complexity analysis of comparative advantages of decision forest constructors. Pattern Anal Appl. 2002; 5(2):102–12.

Liaw A, Wiener M. The randomForest package. Software manual. 2003. https://cran.r-project.org/web/packages/randomForest/.