A hybrid random forest to predict soccer matches in international tournaments

Journal of Quantitative Analysis in Sports - Tập 15 Số 4 - Trang 271-287 - 2019
Andreas Groll1, Christophe Ley2, Gunther Schauberger3, Hans Van Eetvelde2
1TU Dortmund University, Faculty Statistics , Vogelpothsweg 87 , 44227 Dortmund , Germany
2Ghent University , Department of Applied Mathematics, Computer Science and Statistics , Krijgslaan 281, S9, Campus Sterre , Ghent 9000 , Belgium
3Technische Universitaet Muenchen , Department of Sport and Health Sciences , Munich, Bavaria , Germany

Tóm tắt

Abstract In this work, we propose a new hybrid modeling approach for the scores of international soccer matches which combines random forests with Poisson ranking methods. While the random forest is based on the competing teams’ covariate information, the latter method estimates ability parameters on historical match data that adequately reflect the current strength of the teams. We compare the new hybrid random forest model to its separate building blocks as well as to conventional Poisson regression models with regard to their predictive performance on all matches from the four FIFA World Cups 2002–2014. It turns out that by combining the random forest with the team ability parameters from the ranking methods as an additional covariate the predictive power can be improved substantially. Finally, the hybrid random forest is used (in advance of the tournament) to predict the FIFA World Cup 2018. To complete our analysis on the previous World Cup data, the corresponding 64 matches serve as an independent validation data set and we are able to confirm the compelling predictive potential of the hybrid random forest which clearly outperforms all other methods including the betting odds.

Từ khóa


Tài liệu tham khảo

Bischl, B., M. Lang, L. Kotthoff, J. Schiffner, J. Richter, E. Studerus, G. Casalicchio, and Z. M. Jones. 2016. “mlr: Machine Learning in R.” Journal of Machine Learning Research 17:1–5. http://jmlr.org/papers/v17/15-066.html.

Boshnakov, G., T. Kharrat, and I. G. McHale. 2017. “A Bivariate Weibull Count Model for Forecasting Association Football Scores.” International Journal of Forecasting 33:458–466. http://www.sciencedirect.com/science/article/pii/S0169207017300018.

Breiman, L. 2001. “Random Forests.” Machine Learning 45:5–32.

Breiman, L., J. H. Friedman, R. A. Olshen, and J. C. Stone. 1984. Classification and Regression Trees. Monterey, CA: Wadsworth.

Dixon, M. J. and S. G. Coles. 1997. “Modelling Association Football Scores and Inefficiencies in the Football Betting Market.” Journal of the Royal Statistical Society: Series C (Applied Statistics) 46:265–280.

Dyte, D. and S. R. Clarke. 2000. “A Ratings Based Poisson Model for World Cup Soccer Simulation.” Journal of the Operational Research Society 51(8):993–998.

Friedman, J., T. Hastie, and R. Tibshirani. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33:1.

Gneiting, T. and A. E. Raftery. 2007. “Strictly Proper Scoring Rules, Prediction, and Estimation.” Journal of the American Statistical Association 102:359–378.

Groll, A. and J. Abedieh. 2013. “Spain Retains its Title and Sets a New Record – Generalized Linear Mixed Models on European Football Championships.” Journal of Quantitative Analysis in Sports 9:51–66.

Groll, A., T. Kneib, A. Mayr, and G. Schauberger. 2018. “On the Dependency of Soccer Scores – A Sparse Bivariate Poisson Model for the UEFA European Football Championship 2016.” Journal of Quantitative Analysis in Sports 14:65–79.

Groll, A., G. Schauberger, and G. Tutz. 2015. “Prediction of Major International Soccer Tournaments Based on Team-Specific Regularized Poisson Regression: An Application to the FIFA World Cup 2014.” Journal of Quantitative Analysis in Sports 11:97–115.

Hoerl, A. E. and R. W. Kennard. 1970. “Ridge Regression: Biased Estimation for Nonorthogonal Problems.” Technometrics 12:55–67.

Hothorn, T., P. Bühlmann, S. Dudoit, A. Molinaro, and M. J. van der Laan. 2006. “Survival Ensembles.” Biostatistics 7:355–373.

Hothorn, T., P. Buehlmann, T. Kneib, M. Schmid, and B. Hofner. 2017. mboost: Model-Based Boosting. https://CRAN.R-project.org/package=mboost, R package version 2.8-1.

Karlis, D. and I. Ntzoufras. 2003. “Analysis of Sports Data by Using Bivariate Poisson Models.” The Statistician 52:381–393.

Kelly, J. L. 1956. “A New Interpretation of Information Rate.” Bell System Technical Journal 35:917–926. http://dx.doi.org/10.1002/j.1538-7305.1956.tb03809.x.

Koopman, S. J. and R. Lit. 2015. “A Dynamic Bivariate Poisson Model for Analysing and Forecasting Match Results in the English Premier League.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 178:167–186.

Leitner, C., A. Zeileis, and K. Hornik. 2010. “Forecasting Sports Tournaments by Ratings of (Prob)Abilities: A Comparison for the EURO 2008.” International Journal of Forecasting 26(3):471–481.

Ley, C., T. Van de Wiele, and H. Van Eetvelde. 2019. “Ranking Soccer Teams on the Basis of their Current Strength: A Comparison of Maximum Likelihood Approaches.” Statistical Modelling 19:55–77. https://doi.org/10.1177/1471082X18817650.

Maher, M. J. 1982. “Modelling Association Football Scores.” Statistica Neerlandica 36:109–118.

McHale, I. and P. Scarf. 2007. “Modelling Soccer Matches Using Bivariate Discrete Distributions with General Dependence Structure.” Statistica Neerlandica 61:432–445. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9574.2007.00368.x.

McHale, I. G. and P. A. Scarf. 2011. “Modelling the Dependence of Goals Scored by Opposing Teams in International Soccer Matches.” Statistical Modelling 41:219–236.

Probst, P. and A.-L. Boulesteix. 2017. “To Tune or not to Tune the Number of Trees in Random Forest?” Journal of Machine Learning Research 18:181:1–181:18.

Quinlan, J. R. 1986. “Induction of Decision Trees.” Machine Learning 1:81–106.

R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Schauberger, G. and A. Groll. 2018. “Predicting Matches in International Football Tournaments with Random Forests.” Statistical Modelling 18:460–482. https://doi.org/10.1177/1471082X18799934.

Skellam, J. G. 1946. “The Frequency Distribution of the Difference between Two Poisson Variates Belonging to Different Populations.” Journal of the Royal Statistical Society. Series A (General) 109:296–296.

Strobl, C., A.-L. Boulesteix, A. Zeileis, and T. Hothorn. 2007. “Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution.” BMC Bioinformatics 8:25.

Strobl, C., A.-L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis. 2008. “Conditional Variable Importance for Random Forests.” BMC Bioinformatics 9:307.

Tibshirani, R. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society B58:267–288.

Wright, M. N. and A. Ziegler. 2017. “Ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R.” Journal of Statistical Software 77:1–17.

Yuan, M. and Y. Lin. 2006. “Model Selection and Estimation in Regression with Grouped Variables.” Journal of the Royal Statistical Society B68:49–67.