GPU-accelerated Gibbs sampling: a case study of the Horseshoe Probit model
Tóm tắt
Gibbs sampling is a widely used Markov chain Monte Carlo (MCMC) method for numerically approximating integrals of interest in Bayesian statistics and other mathematical sciences. Many implementations of MCMC methods do not extend easily to parallel computing environments, as their inherently sequential nature incurs a large synchronization cost. In the case study illustrated by this paper, we show how to do Gibbs sampling in a fully data-parallel manner on a graphics processing unit, for a large class of exchangeable models that admit latent variable representations. Our approach takes a systems perspective, with emphasis placed on efficient use of compute hardware. We demonstrate our method on a Horseshoe Probit regression model and find that our implementation scales effectively to thousands of predictors and millions of data points simultaneously.
Tài liệu tham khảo
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: Large-scale machine learning on heterogeneous systems. Tech. Rep., Google (2017)
Albert, J.H., Chib, S.: Bayesian analysis of binary and polychotomous response data. J. Am. Stat. Assoc. 88(422), 669–679 (1993)
Beam, A.L., Ghosh, S.K., Doyle, J.: Fast Hamiltonian Monte Carlo using GPU computing. J. Comput. Graph. Stat. 25, 136–548 (2015)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(1), 993–1022 (2003)
Breyer, L., Roberts, G.O., Rosenthal, J.S.: A note on geometric ergodicity and floating-point roundoff error. Stat. Prob. Lett. 53(2), 123–127 (2001)
Canny, J., Zhao, H.: Bidmach: large-scale learning with zero memory allocation. In: NIPS workshop on BigLearning (2013)
Carvalho, C.M., Polson, N.G., Scott, J.G.: The horseshoe estimator for sparse signals. Biometrika 97(2), 1–26 (2010)
Chamandy, N., Muralidharan, O., Wager, S.: Teaching statistics at Google-scale. Am. Stat. 69(4), 283–291 (2015)
Cheng, R.: The generation of gamma variables with non-integral shape parameter. J. R. Stat. Soc. Ser. C (Appl. Stat.) 26(1), 71–75 (1977)
Chopin, N., Ridgway, J.: Leave PIMA Indians alone: binary regression as a benchmark for Bayesian computation (2015). arXiv:1506.08640
de Finetti, B.: La prévision: ses lois logiques, ses sources subjectives. Ann. l’inst. Henri Poincaré 7(1), 1–68 (1937)
Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6, 721–741 (1984)
Hastings, W.K.: Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57(1), 97–109 (1970)
Hoffman, M.D., Gelman, A.: The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res. 15(1), 1593–1623 (2014)
Krizhevsky, A.: One weird trick for parallelizing convolutional neural networks (2014). arXiv:1404.5997
Langford, J.: Vowpal Wabbit open source project. Tech. Rep., Yahoo (2007)
L’Ecuyer, P., Simard, R.: TestU01: A C Library for empirical testing of random number generators. ACM Trans. Math. Softw. 33(4), 22 (2007)
Lee, A., Yau, C., Giles, M.B., Doucet, A., Holmes, C.C.: On the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods. J. Comput. Graph. Stat. 19(4), 769–789 (2010)
Magnusson, M., Jonsson, L., Villani, M., Broman, D.: Sparse partially collapsed MCMC for parallel inference in topic models (2015). arXiv:1506.03784
Makalic, E., Schmidt, D.F.: A simple sampler for the horseshoe estimator (2015). arXiv:1508.03884
Manssen, M., Weigel, M., Hartmann, A.K.: Random number generators for massively parallel simulations on GPU. Eur. Phys. J. Spec. Topics 210(1), 53–71 (2012)
Marsaglia, G.: XORshift RNGs. J. Stat. Softw. 8(14), 1–6 (2003)
Matloff, N.: Programming on Parallel Machines. University of California, Davis (2011)
Matsumoto, M., Nishimura, T.: Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Trans. Model. Comput. Simul. 8(1), 3–30 (1998)
Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of state calculations by fast computing machines. J. Chem. Phys. 21(6), 1087–1092 (1953)
Mingas, G., Bouganis, C.S.: Parallel tempering MCMC acceleration using reconfigurable hardware. Reconfigurable Computing: Architectures, Tools and Applications, pp. 227–238. Springer, Berlin (2012)
Murray, L.M., Lee, A., Jacob, P.E.: Parallel resampling in the particle filter. J. Comput. Graph. Stat. 25(3), 789–805 (2016)
Neubig, G., Goldberg, Y., Dyer, C.: On-the-fly operation batching in dynamic computation graphs. In: 31st Conference on Neural Information Processing Systems (2017)
Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable parallel programming with CUDA. ACM Queue 6(2), 40–53 (2008)
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch. In: Workshop on Automatic Differentiation, 31st Conference on Neural Information Processing Systems (2017)
Robert, C.P.: Simulation of truncated normal variables. Stat. Comput. 5(2), 121–125 (1995)
Salmon, J.K., Moraes, M.A., Dror, R.O., Shaw, D.E.: Parallel random numbers: as easy as 1, 2, 3. In: 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2011)
Sanders, J., Kandrot, E.: CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley, Berlin (2010)
Seita, D., Chen, H., Canny, J.: Fast parallel SAME Gibbs sampling on general discrete Bayesian networks (2015). arXiv:1511.06416
Stone, J.E., Gohara, D., Shi, G.: OpenCL: a parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12(1–3), 66–73 (2010)
Suchard, M.A., Wang, Q., Chan, C., Frelinger, J., Cron, A., West, M.: Understanding GPU programming for statistical computation: studies in massively parallel massive mixtures. J. Comput. Graph. Stat. 19(2), 419–438 (2010)
Terenin, A., Simpson, D., Draper, D.: Asynchronous Gibbs sampling (2016). arXiv:1509.08999
Terenin, A., Magnusson, M., Jonsson, L., Draper, D.: Pólya urn latent Dirichlet allocation: a doubly sparse massively parallel sampler (2017). arXiv:1704.03581
Typesafe Inc (2015) Akka Scala documentation. http://akka.io/
Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S-PLUS. Springer, Berlin (2002)
Yan, F., Xu, N., Qi, Y.: Parallel inference for latent Dirichlet allocation on graphics processing units. In: Advances in Neural Information Processing Systems, vol. 20, pp. 2134–2142 (2009)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, vol. 10, pp. 10–16 (2010)
Zhan, Y., Kindratenko, V.: Evaluating AMD HIP toolset for migrating CUDA applications to AMD GPUS. Tech. Rep., National Center for Supercomputing Applications (2016)