Fast greedy $$\mathcal {C}$$ -bound minimization with guarantees

Machine Learning - Tập 109 - Trang 1945-1986 - 2020
Baptiste Bauvin1,2, Cécile Capponi2, Jean-Francis Roy3, François Laviolette2,4
1Aix Marseille University, Toulon University, CNRS, LIS (Qarma), Marseille, France
2Dép. d’informatique et de génie logiciel, Université Laval, Québec, Canada
3Coveo Solutions Inc., Québec, Canada
4Element AI, Montréal, Canada

Tóm tắt

The $$\mathcal {C}$$ -bound is a tight bound on the true risk of a majority vote classifier that relies on the individual quality and pairwise disagreement of the voters and provides PAC-Bayesian generalization guarantees. Based on this bound, MinCq is a classification algorithm that returns a dense distribution on a finite set of voters by minimizing it. Introduced later and inspired by boosting, CqBoost uses a column generation approach to build a sparse $$\mathcal {C}$$ -bound optimal distribution on a possibly infinite set of voters. However, both approaches have a high computational learning time because they minimize the $$\mathcal {C}$$ -bound by solving a quadratic program. Yet, one advantage of CqBoost is its experimental ability to provide sparse solutions. In this work, we address the problem of accelerating the $$\mathcal {C}$$ -bound minimization process while keeping the sparsity of the solution and without losing accuracy. We present CB-Boost, a computationally efficient classification algorithm relying on a greedy–boosting-based– $$\mathcal {C}$$ -bound optimization. An in-depth analysis proves the optimality of the greedy minimization process and quantifies the decrease of the $$\mathcal {C}$$ -bound operated by the algorithm. Generalization guarantees are then drawn based on already existing PAC-Bayesian theorems. In addition, we experimentally evaluate the relevance of CB-Boost in terms of the three main properties we expect about it: accuracy, sparsity, and computational efficiency compared to MinCq, CqBoost, Adaboost and other ensemble methods. As observed in these experiments, CB-Boost not only achieves results comparable to the state of the art, but also provides $$\mathcal {C}$$ -bound sub-optimal weights with very few computational demand while keeping the sparsity property of CqBoost.

Tài liệu tham khảo

Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. Catoni, O. (2007). PAC-Bayesian supervised classification: the thermodynamics of statistical learning. arXiv preprint arXiv:0712.0248. Cortes, C., Mohri, M., & Syed, U. (2014). Deep boosting. In: Proceedings of the thirty-first international conference on machine learning (ICML) (2014). Demiriz, A., Bennett, K. P., & Shawe-Taylor, J. (2002). Linear programming boosting via column generation. Machine Learning, 46(1), 225–254. Dua, D., & Graff, C. (2017). UCI machine learning repository. Dziugaite, G.K., & Roy, D.M. (2018). Data-dependent PAC-Bayes priors via differential privacy. In: Advances in neural information processing systems, pp. 8440–8450. Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139. Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5), 1189–1232. Germain, P., Lacasse, A., Laviolette, F., & Marchand, M. (2009). PAC-Bayesian learning of linear classifiers. In: Proceedings of the 26th ICML, pp. 353–360. ACM. Germain, P., Lacasse, A., Laviolette, F., Marchand, M., & Roy, J. F. (2015). Risk bounds for the majority vote: From a PAC-Bayesian analysis to a learning algorithm. Journal of Machine Learning Research, 16(1), 787–860. Lacasse, A., Laviolette, F., Marchand, M., Germain, P., & Usunier, N. (2006). PAC-Bayes bounds for the risk of the majority vote and the variance of the Gibbs classifier. In: B. Schölkopf, J.C. Platt, T. Hoffman (Eds.) Advances in neural information processing systems 19, pp. 769–776. MIT Press. Lampert, C.H., Nickisch, H., & Harmeling, S. (2009). Learning to detect unseen object classes by between-class attribute transfer. In: 2009 IEEE Conference on computer vision and pattern recognition, pp. 951–958. IEEE. Lanckriet, G. R. G., Cristianini, N., Bartlett, P., Ghaoui, L. E., & Jordan, M. I. (2004). Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5, 27–72. Langford, J., & Shawe-Taylor, J. (2003). PAC-Bayes & margins. In: Advances in neural information processing systems, pp. 439–446. LeCun, Y., & Cortes, C. (2010). MNIST handwritten digit database. Marchand, M., & Taylor, J. S. (2003). The set covering machine. Journal of Machine Learning Research, 3, 723–746. McAllester, D. A. (1999). Some PAC-Bayesian theorems. Machine Learning, 37(3), 355–363. McAllester, D. A. (2003). PAC-Bayesian stochastic model selection. Machine Learning, 51(1), 5–21. Parrado-Hernández, E., Ambroladze, A., Shawe-Taylor, J., & Sun, S. (2012). PAC-Bayes bounds with data dependent priors. Journal of Machine Learning Research, 13, 3507–3531. Roy, J.F., Marchand, M., & Laviolette, F. (2016) A column generation bound minimization approach with PAC-Bayesian generalization guarantees. In: A. Gretton, C.C. Robert (eds.) Proceedings of the 19th international conference on artificial intelligence and statistics, proceedings of machine learning research, vol. 51, pp. 1241–1249. PMLR, Cadiz, Spain. http://proceedings.mlr.press/v51/roy16.html. Schapire, R. E., & Freund, Y. (2012). Boosting: foundations and algorithms. Cambridge: The MIT Press. Seeger, M. (2002). PAC-Bayesian generalisation error bounds for gaussian process classification. Journal of Machine Learning Research, 3, 233–269. Seldin, Y., Cesa-Bianchi, N., Auer, P., Laviolette, F., & Shawe-Taylor, J. (2012). PAC-Bayes-bernstein inequality for martingales and its application to multiarmed bandits. Proceedings of the Workshop on On-line Trading of Exploration and Exploitation, 2, 98–111. Sonnenburg, S., Rätsch, G., Schäfer, C., & Schölkopf, B. (2006). Large scale multiple kernel learning. Journal of Machine Learning Research, 7, 1531–1565. Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11), 1134–1142. Zhu, J., Rosset, S., Zou, H., & Hastie, T. (2006). Multi-class adaboost. Statistics and its Interface,. https://doi.org/10.4310/SII.2009.v2.n3.a8.