Control variates for stochastic gradient MCMC

Jack Baker1, Paul Fearnhead1, Emily B. Fox2, Christopher Nemeth1
1STOR-i Centre for Doctoral Training, Department of Mathematics and Statistics, Lancaster University, Lancaster, UK
2Department of Statistics, University of Washington, Seattle, USA

Tóm tắt

It is well known that Markov chain Monte Carlo (MCMC) methods scale poorly with dataset size. A popular class of methods for solving this issue is stochastic gradient MCMC (SGMCMC). These methods use a noisy estimate of the gradient of the log-posterior, which reduces the per iteration computational cost of the algorithm. Despite this, there are a number of results suggesting that stochastic gradient Langevin dynamics (SGLD), probably the most popular of these methods, still has computational cost proportional to the dataset size. We suggest an alternative log-posterior gradient estimate for stochastic gradient MCMC which uses control variates to reduce the variance. We analyse SGLD using this gradient estimate, and show that, under log-concavity assumptions on the target distribution, the computational cost required for a given level of accuracy is independent of the dataset size. Next, we show that a different control-variate technique, known as zero variance control variates, can be applied to SGMCMC algorithms for free. This postprocessing step improves the inference of the algorithm by reducing the variance of the MCMC output. Zero variance control variates rely on the gradient of the log-posterior; we explore how the variance reduction is affected by replacing this with the noisy gradient estimate calculated by SGMCMC.

Từ khóa


Tài liệu tham khảo

Ahn, S., Korattikara, A., Liu, N., Rajan, S., Welling, M.: Large-scale distributed Bayesian matrix factorization using stochastic gradient MCMC. In In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 9–18. ACM, New York (2015)

Baker, J., Fearnhead, P., Fox, E.B., Nemeth, C.: sgmcmc: An R package for stochastic gradient Markov Chain Monte Carlo. J. Stat. Softw. (2017). https://arxiv.org/abs/1710.00578

Bardenet, R., Doucet, A., Holmes, C.: On Markov chain Monte Carlo methods for tall data. J. Mach. Learn. Res. 18(47), 1–43 (2017)

Bierkens, J., Fearnhead, P., Roberts, G.: The zig-zag process and super-efficient sampling for Bayesian analysis of big data. (2016). https://arxiv.org/abs/1607.03188

Blackard, J.A., Dean, D.J.: Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Comput. Electron. Agric. 24(3), 131–151 (1999)

Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Proceedings of the 19th international conference on computational statistics, pp. 177–187. Springer, Berlin (2010)

Chatterji, N.S., Flammarion, N., Ma, Y.-A., Bartlett, P.L., Jordan, M.I.: On the theory of variance reduction for stochastic gradient Monte Carlo. (2018). https://arxiv.org/abs/1802.05431v1

Chen, C., Wang, W., Zhang, Y., Su, Q., Carin, L.: A convergence analysis for a class of practical variance-reduction stochastic gradient MCMC. (2017). https://arxiv.org/abs/1709.01180

Chen, T., Fox, E., Guestrin, C.: Stochastic gradient Hamiltonian Monte Carlo. In: Proceedings of the 31st international conference on machine learning PMLR, pp. 1683–1691 (2014)

Dalalyan, A.S., Karagulyan, A.G.: User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. (2017). https://arxiv.org/abs/1710.00095

Ding, N., Fang, Y., Babbush, R., Chen, C., Skeel, R.D., Neven, H.: Bayesian sampling using stochastic gradient thermostats. In: Advances in neural information processing systems 27, pp. 3203–3211. Curran Associates, Inc, Red Hook (2014)

Dubey, K.A., Reddi, S.J., Williamson, S.A., Poczos, B., Smola, A.J., Xing, E.P.: Variance reduction in stochastic gradient Langevin dynamics. In: Advances in neural information processing systems, vol. 29, pp. 1154–1162. Curran Associates, Inc, Red Hook (2016)

Durmus, A., Moulines, E.: High-dimensional bayesian inference via the unadjusted langevin algorithm. (2016). https://hal.archives-ouvertes.fr/hal-01304430/

Friel, N., Mira, A., Oates, C.: Exploiting multi-core architectures for reduced-variance estimation with intractable likelihoods. Bayesian Anal. 11(1), 215–245 (2016)

Le Cam, L.: Asymptotic Methods in Statistical Decision Theory. Springer, Berlin (2012)

Mira, A., Solgi, R., Imparato, D.: Zero variance Markov chain Monte Carlo for Bayesian estimators. Stat. Comput. 23(5), 653–662 (2013)

Mnih, A., Salakhutdinov, R.R.: Probabilistic matrix factorization. In: Advances in neural information processing systems, vol. 20, pp. 1257–1264. Curran Associates, Inc, Red Hook (2008)

Nagapetyan, T., Duncan, A., Hasenclever, L., Vollmer, S.J., Szpruch, L., Zygalakis, K.: The true cost of stochastic gradient Langevin dynamics. (2017). https://arxiv.org/abs/1706.02692

Neal, R.M.: MCMC using Hamiltonian dynamics. In: Brooks, S., Gelman, A., Jones, G.L., Meng, X.-L. (eds.) Handbook of Markov Chain Monte Carlo. Chapman & Hall, Boca Raton (2010)

Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)

Papamarkou, T., Mira, A., Girolami, M.: Zero variance differential geometric Markov chain Monte Carlo algorithms. Bayesian Anal. 9(1), 97–128 (2014)

Patterson, S., Teh, Y.W.: Stochastic gradient Riemannian Langevin dynamics on the probability simplex. In: Advances in neural information processing systems, vol. 26, pp. 3102–3110. Curran Associates, Inc, Red Hook (2013)

Pollock, M., Fearnhead, P., Johansen, A.M., Roberts, G.O.: The scalable Langevin exact algorithm: Bayesian inference for big data. (2016). https://arxiv.org/abs/1609.03436

Ripley, B.D.: Stochastic Simulation. Wiley, Hoboken (2009)

Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)

Roberts, G.O., Tweedie, R.L.: Exponential convergence of langevin distributions and their discrete approximations. Bernoulli 2(4), 341–363 (1996)

Teh, Y.W., Thiéry, A.H., Vollmer, S.J.: Consistency and fluctuations for stochastic gradient Langevin dynamics. J. Mach. Learn. Res. 17(7), 1–33 (2016)

Tran, D., Kucukelbir, A., Dieng, A.B., Rudolph, M., Liang, D., Blei, D.M.: Edward: a library for probabilistic modeling, inference, and criticism. (2016). https://arxiv.org/abs/1610.09787

Vollmer, S.J., Zygalakis, K.C.: (Non-) asymptotic properties of stochastic gradient Langevin dynamics. J. Mach. Learn. Res. 17(159), 1–48 (2016)

Welling, M., Teh, Y.W.: Bayesian learning via stochastic gradient Langevin dynamics. In: Proceedings of the 28th international conference on machine learning PMLR, pp. 681–688 (2011)

Wenzhe Li, Sungjin Ahn, M.W.: Scalable MCMC for mixed membership stochastic blockmodels. In: Proceedings of the 19th international conference on artificial intelligence and statistics PMLR, pp. 723–731 (2016)