Can’t see the forest for the trees

Gero Szepannek1, Björn-Hergen von Holt2
1Stralsund University of Applied Sciences, Stralsund, Germany
2Institute of Medical Biometry and Statistics, University of Lübeck, Lübeck, Germany

Tóm tắt

AbstractRandom forests are currently one of the most popular algorithms for supervised machine learning tasks. By taking into account for many trees instead of a single one the resulting forest model is no longer easy to understand and also often denoted as a black box. The paper is dedicated to the interpretability of random forest models using tree-based explanations. Two different concepts, namely most representative trees and surrogate trees are analyzed regarding both their ability to explain the model and to be understandable by humans. For this purpose explanation trees are further extended to groves, i.e. small forests of few trees. The results of an application to three real world data sets underline the inherent trade of between both requirements. Using groves allows to control for the complexity of an explanation while simultaneously analyzing their explanatory power.

Từ khóa


Tài liệu tham khảo

Banerjee M, Ding Y, Noone AM (2012) Identifying representative trees from ensembles. Stat Med 31(15):1601–16. https://doi.org/10.1002/sim.4492

Biecek, P.: Dalex (2018) Explainers for complex predictive models in r. J Mach Learn Res 19(84):1–5. https://jmlr.org/papers/v19/18-416.html

Bischl B, Binder M, Lang M, Pielok T, Richter J, Coors S, Thomas J, Ullmann T, Becker M, Boulesteix AL, Deng D, Lindauer M (2021) Hyperparameter optimization: Foundations, algorithms, best practices and open challenges. CoRR arXiv:2107.05847

Breiman L (2001) Random forests. Mach Learn 45(1):5–32. https://doi.org/10.1023/A:1010933404324

Bücker M, Szepannek G, Gosiewska A, Biecek P (2021) Transparency, Auditability and explainability of machine learning models in credit scoring. J Oper Res Soc. https://doi.org/10.1080/01605682.2021.1922098

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Int Res 16(1):321–357

Cowan N (2010) The magical mystery four: How is working memory capacity limited, and why? Curr Dir Psychol Sci 19(1):51–57. https://doi.org/10.1177/0963721409359277

DeLong ER, DeLong DM, Clarke-Pearson DL (1988) Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44(3):837–845. https://doi.org/10.2307/2531595

Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 20 Nov 2022

European Commission (2020) On artificial intelligence—a European approach to excellence and trust. https://ec.europa.eu/info/sites/info/files/commission-white-paper-artificial-intelligence-feb2020_en.pdf. Accessed 20 Nov 2022

Fernández-Delgado M, Cernadas E, Barro S, Amorim D (2014) Do we need hundreds of classifiers to solve real world classification problems? J Mach Learn Res 15(1):3133–3181

Friedman J (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29:1189–1232

Gower JC (1971) A general coefficient of similarity and some of its properties. Biometrics 27(4):857–871. https://doi.org/10.2307/2528823

Groemping U (2019) South German credit data: correcting a widely used data set. Technical report 4/2019, Department II, Beuth University of Applied Sciences Berlin. http://www1.beuth-hochschule.de/FB_II/reports/Report-2019-004.pdf. Accessed 20 Nov 2022

Laabs von Holt BH (2020) timbR: Tree interpretation methods based on range, r package version 0.1.0. https://github.com/imbs-hl/timbR. Accessed 20 Nov 2022

Laabs von Holt BH, Westenberger A, König IR (2022) Identification of representative trees in random forests based on a new tree-based distance measure. biorXiv. https://doi.org/10.1101/2022.05.15.492004

Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2022) Cluster: cluster analysis basics and extensions. R package version 2.1.4. https://CRAN.R-project.org/package=cluster. Accessed 20 Nov 2022

Miller G (1956) The magical number seven, plus or minus two: some limits on our capacity for processing information. Psychol Rev 63(2):81–97. https://doi.org/10.1037/h0043158

Molnar C (2022) Interpretable machine learning, 2nd edn. https://christophm.github.io/interpretable-ml-book. Accessed 20 Nov 2022

Moro S, Cortez P, Rita P (2014) A data-driven approach to predict the success of bank telemarketing. Decis Support Syst 62:22–31. https://doi.org/10.1016/j.dss.2014.03.001

Murtagh F, Contreras P (2017) Algorithms for hierarchical clustering: an overview, ii. WIREs Data Min Knowl Discov. https://doi.org/10.1002/widm.1219

Probst P, Boulesteix AL, Bischl B (2021) Tunability: importance of hyperparameters of machine learning algorithms. J Mach Learn Res 20(1):1934–1965

Ridgeway G (2020) Generalized boosted models: a guide to the gbm package. https://cran.r-project.org/web/packages/gbm/vignettes/gbm.pdf. Accessed 20 Nov 2022

Robnik-SŠikonja M, Bohanec M (2018) Perturbation-based explanations of prediction models. Springer International Publishing, Cham, pp 159–175

Szepannek G (2017) On the practical relevance of modern machine learning algorithms for credit scoring applications. WIAS Rep Ser 29:88–96. https://doi.org/10.20347/wias.report.29

Szepannek G (2022) An overview on the landscape of r packages for open source scorecard modelling. Risks. https://doi.org/10.3390/risks10030067

Szepannek G, Lübke K (2021) Facing the challenges of developing fair risk scoring models. Front Artif Intell 4:117. https://doi.org/10.3389/frai.2021.681915

Szepannek G, Lübke K (2022) Explaining artificial intelligence with care. KI Künstliche Intelligenz. https://doi.org/10.1007/s13218-022-00764-8

Szepannek G, Lübke K (2023) How much do we see? on the explainability of partial dependence plots for credit risk scoring. Argum Oecon. https://doi.org/10.15611/aoe.2023.1.07

Therneau TM, Atkinson EJ (2015) An introduction to recursive partitioning using the rpart routines. https://www.biostat.wisc.edu/~kbroman/teaching/statgen/2004/refs/therneau.pdf. Accessed 20 Nov 2022

Vanschoren J, van Rijn JN, Bischl B, Torgo L (2013) OpenML: networked science in machine learning. SIGKDD Explor 15(2):49–60. https://doi.org/10.1145/2641190.2641198

Wright MN, Ziegler A (2017) ranger: A fast implementation of random forests for high dimensional data in c++ and r. J Stat Softw 77(1):1–17. https://doi.org/10.18637/jss.v077.i01