Advanced Search
Article Contents
Article Contents
Early Access

Early Access articles are published articles within a journal that have not yet been assigned to a formal issue. This means they do not yet have a volume number, issue number, or page numbers assigned to them, however, they can still be found and cited using their DOI (Digital Object Identifier). Early Access publication benefits the research community by making new scientific discoveries known as quickly as possible.

Readers can access Early Access articles via the “Early Access” tab for the selected journal.

Improving sampling accuracy of stochastic gradient MCMC methods via non-uniform subsampling of gradients

  • *Corresponding author

    *Corresponding author

© 2021 The Author(s). Published by AIMS, LLC. This is an Open Access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

Abstract Full Text(HTML) Figure(6) / Table(3) Related Papers Cited by
  • Many Markov Chain Monte Carlo (MCMC) methods leverage gradient information of the potential function of target distribution to explore sample space efficiently. However, computing gradients can often be computationally expensive for large scale applications, such as those in contemporary machine learning. Stochastic Gradient (SG-)MCMC methods approximate gradients by stochastic ones, commonly via uniformly subsampled data points, and achieve improved computational efficiency, however at the price of introducing sampling error. We propose a non-uniform subsampling scheme to improve the sampling accuracy. The proposed exponentially weighted stochastic gradient (EWSG) is designed so that a non-uniform-SG-MCMC method mimics the statistical behavior of a batch-gradient-MCMC method, and hence the inaccuracy due to SG approximation is reduced. EWSG differs from classical variance reduction (VR) techniques as it focuses on the entire distribution instead of just the variance; nevertheless, its reduced local variance is also proved. EWSG can also be viewed as an extension of the importance sampling idea, successful for stochastic-gradient-based optimizations, to sampling tasks. In our practical implementation of EWSG, the non-uniform subsampling is performed efficiently via a Metropolis-Hastings chain on the data index, which is coupled to the MCMC algorithm. Numerical experiments are provided, not only to demonstrate EWSG's effectiveness, but also to guide hyperparameter choices, and validate our non-asymptotic global error bound despite of approximations in the implementation. Notably, while statistical accuracy is improved, convergence speed can be comparable to the uniform version, which renders EWSG a practical alternative to VR (but EWSG and VR can be combined too).

    Mathematics Subject Classification: Primary: 65C05, 65C40, 60J20; Secondary: 62D05, 37A50, 37A60, 90C15.


    \begin{equation} \\ \end{equation}
  • 加载中
  • Figure 1.  Performance quantification (Gaussian target)

    Figure 2.  BLR learning curve

    Figure 3.  BNN learning curve. Shade: one standard deviation.

    Figure 4.  KL divergence

    Figure 5.  Posterior prediction of mean (left) and standard deviation (right) of log likelihood on test data set generated by SGHMC, EWSG and EWSG-VR on two Bayesian logistic regression tasks. Statistics are computed based on 1000 independent simulations. Minibatch size $ b = 1 $ for all methods except FG. $ M = 1 $ for EWSG and EWSG-VR

    Figure 6.  (a) Histogram of data used in each iteration for FlyMC algorithm. (b) Autocorrelation plot of FlyMC, EWSG and MH. (c) Samples of EWSG. (d) Samples of FlyMC

    Table 1.  Accuracy, log likelihood and wall time of various algorithms on test data after one data pass (mean $ \pm $ std)

    Accuracy(%) 75.283 $ \pm $ 0.016 75.126 $ \pm $ 0.020 75.268 $ \pm $ 0.017 75.306 $ \pm $ 0.016 75.199 $ \pm $ 0.080
    Log Likelihood -0.525 $ \pm $ 0.000 -0.526 $ \pm $ 0.000 -0.525 $ \pm $ 0.000 -0.523 $ \pm $ 0.000 -0.523 $ \pm $ 0.000
    Wall Time (s) 3.085 $ \pm $ 0.283 4.312 $ \pm $ 0.359 3.145 $ \pm $ 0.307 3.755 $ \pm $ 0.387 291.295 $ \pm $ 56.368
     | Show Table
    DownLoad: CSV

    Table 2.  Test error (mean $ \pm $ standard deviation) after 200 epoches

    Method Test Error(%), MLP Test Error(%), CNN
    SGLD 1.976 $ \pm $ 0.055 0.848 $ \pm $ 0.060
    pSGLD 1.821 $ \pm $ 0.061 0.860 $ \pm $ 0.052
    SGHMC 1.833 $ \pm $ 0.073 0.778 $ \pm $ 0.040
    CP-SGHMC 1.835 $ \pm $ 0.047 0.772 $ \pm $ 0.055
    EWSG 1.793 $ \pm $ 0.100 0.753 $ \pm $ 0.035
     | Show Table
    DownLoad: CSV

    Table 3.  Test errors of EWSG (top of each cell) and SGHMC (bottom of each cell) after 200 epoches. $ b $ is minibatch size for EWSG, and minibatch size of SGHMC is set as $ b\times(M+1) $ to ensure the same number of data used per parameter update for both algorithms. Step size is set $ h = \frac{10}{b(M+1)} $ as suggested in [12], different from that used to produce Table 2. Results with smaller test error is highlighted in boldface

    $ b $ $ M+1=2 $ $ M+1=5 $ $ M+1=10 $
    $ 100 $ 1.86% 1.83% 1.80%
    1.94% 1.92% 1.97%
    $ 200 $ 1.90% 1.87% 1.80%
    1.87% 1.97% 2.07%
    $ 500 $ 1.79% 2.01% 2.36%
    1.97% 2.17% 2.37%
     | Show Table
    DownLoad: CSV
  • [1] S. Ahn, A. Korattikara and M. Welling, Bayesian posterior sampling via stochastic gradient fisher scoring, In 29th International Conference on Machine Learning, ICML 2012, (2012), 1591–1598.
    [2] F. Bach, Stochastic gradient methods for machine learning, Technical report, INRIA - Ecole Normale Superieur, 2013. http://lear.inrialpes.fr/people/harchaoui/projects/gargantua/slides/bach_gargantua_nov2013.pdf.
    [3] J. BakerP. FearnheadE. Fox and C. Nemeth, Control variates for stochastic gradient MCMC, Stat. Comput., 29 (2019), 599-615.  doi: 10.1007/s11222-018-9826-2.
    [4] R. Bardenet, A. Doucet and C. Holmes, On markov chain Monte Carlo methods for tall data, J. Mach. Learn. Res., 18 (2017), 43 pp.
    [5] V. S. Borkar and S. K. Mitter, A strong approximation theorem for stochastic recursive algorithms, J. Optim. Theory Appl., 100 (1999), 499-513.  doi: 10.1023/A:1022630321574.
    [6] N. Bou-RabeeA. Eberle and R. Zimmer, Coupling and convergence for Hamiltonian Monte Carlo, Ann. Appl. Probab., 30 (2018), 1209-1250.  doi: 10.1214/19-AAP1528.
    [7] N. Bou-Rabee and H. Owhadi, Long-run accuracy of variational integrators in the stochastic context, SIAM J. Numer. Anal., 48 (2010), 278-297.  doi: 10.1137/090758842.
    [8] N. Bou-Rabee and J. M. Sanz-Serna, Geometric integrators and the Hamiltonian Monte Carlo method, Acta Numer., 27 (2018), 113-206.  doi: 10.1017/S0962492917000101.
    [9] S. BrooksA. GelmanG. Jones and  X.-L. MengHandbook of Markov Chain Monte Carlo, CRC press, 2011.  doi: 10.1201/b10905.
    [10] N. ChatterjiN. FlammarionY. MaP. Bartlett and M. Jordan, On the theory of variance reduction for stochastic gradient monte carlo, ICML, (2018). 
    [11] C. ChenN. Ding and L. Carin, On the convergence of stochastic gradient mcmc algorithms with high-order integrators, Advances in Neural Information Processing Systems, (2015), 2278-2286. 
    [12] T. ChenE. B. Fox and C. Guestrin, Stochastic gradient hamiltonian monte carlo, International Conference on Machine Learning, (2014), 1683-1691. 
    [13] X. Cheng, N. S. Chatterji, Y. Abbasi-Yadkori, P. L. Bartlett and M. I. Jordan, Sharp convergence rates for langevin dynamics in the nonconvex setting, preprint, arXiv: 1805.01648, 2018.
    [14] X. ChengN. S. ChatterjiP. L. Bartlett and M. I. Jordan, Underdamped langevin mcmc: A non-asymptotic analysis, Proceedings of the 31st Conference On Learning Theory, PMLR, (2018). 
    [15] D. Csiba and P. Richtárik, Importance sampling for minibatches, J. Mach. Learn. Res., 19 (2018), 21 pp.
    [16] A. S. Dalalyan and A. Karagulyan, User-friendly guarantees for the langevin Monte Carlo with inaccurate gradient, Stochastic Process. Appl., 129 (2019), 5278-5311.  doi: 10.1016/j.spa.2019.02.016.
    [17] A. Defazio, F. Bach and S. Lacoste-Julien, Saga: A fast incremental gradient method with support for non-strongly convex composite objectives, In Advances in Neural Information Processing Systems, (2014), 1646–1654.
    [18] A. Defazio and L. Bottou, On the ineffectiveness of variance reduced optimization for deep learning, In Advances in Neural Information Processing Systems, (2019), 1755–1765.
    [19] K. A. Dubey, S. J. Reddi, S. A. Williamson, B. Poczos, A. J. Smola and E. P. Xing, Variance reduction in stochastic gradient langevin dynamics, In Advances in Neural Information Processing Systems, (2016), 1154–1162.
    [20] T. Fu and Z. Zhang, Cpsg-mcmc: Clustering-based preprocessing method for stochastic gradient mcmc, In Artificial Intelligence and Statistics, (2017), 841–850.
    [21] K. Jarrett, K. Kavukcuoglu, M. A. Ranzato and Y. LeCun, What is the best multi-stage architecture for object recognition? In 2009 IEEE 12th International Conference on Computer Vision, (2009), 2146–2153. doi: 10.1109/ICCV.2009.5459469.
    [22] R. Johnson and T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, Advances in Neural Information Processing Systems, (2013), 315-323. 
    [23] A. Korattikara, Y. Chen and M. Welling, Austerity in mcmc land: Cutting the metropolis-hastings budget, In International Conference on Machine Learning, (2014), 181–189.
    [24] R. Kubo, The fluctuation-dissipation theorem, Reports on Progress in Physics, 29 (1966).  doi: 10.1088/0034-4885/29/1/306.
    [25] C. Li, C. Chen, D. Carlson and L. Carin, Preconditioned stochastic gradient langevin dynamics for deep neural networks, In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
    [26] Q. Li, C. Tai and E. Weinan, Stochastic modified equations and adaptive stochastic gradient algorithms, In J. Mach. Learn. Res., 20 (2019), 47 pp.
    [27] R. Li, H. Zha and M. Tao, Mean-square analysis with an application to optimal dimension dependence of Langevin Monte Carlo, preprint, arXiv: 2109.03839, 2021.
    [28] M. Lichman et al, UCI Machine Learning Repository, 2013.
    [29] Y.-A. Ma, T. Chen and E. B. Fox, A complete recipe for stochastic gradient mcmc, In Advances in Neural Information Processing Systems, (2015), 2917–2925.
    [30] D. Maclaurin and R. P. Adams, Firefly monte carlo: Exact mcmc with subsets of data, In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
    [31] S. Mandt, M. D. Hoffman and D. M. Blei, Stochastic gradient descent as approximate Bayesian inference, J. Mach. Learn. Res., 18 (2017), 134, 35 pp.
    [32] J. C. MattinglyA. M. Stuart and M. V. Tretyakov, Convergence of numerical time-averaging and stationary measures via Poisson equations, SIAM J. Numer. Anal., 48 (2010), 552-577.  doi: 10.1137/090770527.
    [33] D. NeedellR. Ward and N. Srebro, Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm, Math. Program., 155 (2016), 549-573.  doi: 10.1007/s10107-015-0864-7.
    [34] S. Patterson and Y. W. Teh, Stochastic gradient Riemannian Langevin dynamics on the probability simplex, Advances in Neural Information Processing Systems, (2013), 3102-3110. 
    [35] G. A. Pavliotis, Stochastic Processes and Applications: Diffusion Processes, the Fokker-Planck and Langevin Equations, Texts in Applied Mathematics, 60. Springer, New York, 2014. doi: 10.1007/978-1-4939-1323-7.
    [36] G. O Roberts and R. L. Tweedie, et al., Exponential convergence of Langevin distributions and their discrete approximations, Bernoulli, 2 (1996), 341-363.  doi: 10.2307/3318418.
    [37] M. SchmidtR. BabanezhadM. AhmedA. DefazioA. Clifton and A. Sarkar, Non-uniform stochastic average gradient method for training conditional random fields, Artificial Intelligence and Statistics, (2015), 819-828. 
    [38] M. SchmidtN. L. Roux and F. Bach, Minimizing finite sums with the stochastic average gradient, Math. Program., 162 (2017), 83-112.  doi: 10.1007/s10107-016-1030-6.
    [39] M. Tao and T. Ohsawa, Variational optimization on lie groups, with examples of leading (generalized) eigenvalue problems, AISTATS, (2020). 
    [40] Y. W. Teh, A. H. Thiery and S. J. Vollmer, Consistency and fluctuations for stochastic gradient Langevin dynamics, J. Mach. Learn. Res., 17 (2016), 7, 33 pp.
    [41] S. J. Vollmer, K. C. Zygalakis and Y. W. Teh, Exploration of the (non-) asymptotic bias and variance of stochastic gradient langevin dynamics, J. Mach. Learn. Res., 17 (2016), 159, 45 pp.
    [42] M. Welling and Y. W. Teh, Bayesian learning via stochastic gradient langevin dynamics, International Conference on Machine Learning, (2001), 681-688. 
    [43] A. G. Wilson, The case for bayesian deep learning, preprint, arXiv: 2001.10995, 2020.
    [44] R. ZhangA. F. Cooper and C. D. Sa, Asymptotically optimal exact minibatch metropolis-hastings, NeurIPS, (2020). 
    [45] R. Zhang and C. D. Sa, Poisson-minibatching for gibbs sampling with convergence rate guarantees, NeurIPS, (2019). 
    [46] P. Zhao and T. Zhang, Stochastic optimization with importance sampling for regularized loss minimization, International Conference on Machine Learning, (2015), 1-9. 
    [47] R. Zhu, Gradient-based sampling: An adaptive importance sampling for least-squares, Advances in Neural Information Processing Systems, (2016), 406-414. 
  • 加载中




Article Metrics

HTML views(400) PDF downloads(434) Cited by(0)

Access History

Other Articles By Authors



    DownLoad:  Full-Size Img  PowerPoint