# American Institute of Mathematical Sciences

doi: 10.3934/dcdss.2021157
Online First

Online First articles are published articles within a journal that have not yet been assigned to a formal issue. This means they do not yet have a volume number, issue number, or page numbers assigned to them, however, they can still be found and cited using their DOI (Digital Object Identifier). Online First publication benefits the research community by making new scientific discoveries known as quickly as possible.

Readers can access Online First articles via the “Online First” tab for the selected journal.

## Improving sampling accuracy of stochastic gradient MCMC methods via non-uniform subsampling of gradients

 1 Georgia Institute of Technology, USA 2 Google Research, USA 3 School of Data Science, Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen, China

*Corresponding author

Received  April 2021 Revised  September 2021 Early access December 2021

Many Markov Chain Monte Carlo (MCMC) methods leverage gradient information of the potential function of target distribution to explore sample space efficiently. However, computing gradients can often be computationally expensive for large scale applications, such as those in contemporary machine learning. Stochastic Gradient (SG-)MCMC methods approximate gradients by stochastic ones, commonly via uniformly subsampled data points, and achieve improved computational efficiency, however at the price of introducing sampling error. We propose a non-uniform subsampling scheme to improve the sampling accuracy. The proposed exponentially weighted stochastic gradient (EWSG) is designed so that a non-uniform-SG-MCMC method mimics the statistical behavior of a batch-gradient-MCMC method, and hence the inaccuracy due to SG approximation is reduced. EWSG differs from classical variance reduction (VR) techniques as it focuses on the entire distribution instead of just the variance; nevertheless, its reduced local variance is also proved. EWSG can also be viewed as an extension of the importance sampling idea, successful for stochastic-gradient-based optimizations, to sampling tasks. In our practical implementation of EWSG, the non-uniform subsampling is performed efficiently via a Metropolis-Hastings chain on the data index, which is coupled to the MCMC algorithm. Numerical experiments are provided, not only to demonstrate EWSG's effectiveness, but also to guide hyperparameter choices, and validate our non-asymptotic global error bound despite of approximations in the implementation. Notably, while statistical accuracy is improved, convergence speed can be comparable to the uniform version, which renders EWSG a practical alternative to VR (but EWSG and VR can be combined too).

Citation: Ruilin Li, Xin Wang, Hongyuan Zha, Molei Tao. Improving sampling accuracy of stochastic gradient MCMC methods via non-uniform subsampling of gradients. Discrete & Continuous Dynamical Systems - S, doi: 10.3934/dcdss.2021157
##### References:

show all references

##### References:
Performance quantification (Gaussian target)
BLR learning curve
BNN learning curve. Shade: one standard deviation.
KL divergence
Posterior prediction of mean (left) and standard deviation (right) of log likelihood on test data set generated by SGHMC, EWSG and EWSG-VR on two Bayesian logistic regression tasks. Statistics are computed based on 1000 independent simulations. Minibatch size $b = 1$ for all methods except FG. $M = 1$ for EWSG and EWSG-VR
(a) Histogram of data used in each iteration for FlyMC algorithm. (b) Autocorrelation plot of FlyMC, EWSG and MH. (c) Samples of EWSG. (d) Samples of FlyMC
Accuracy, log likelihood and wall time of various algorithms on test data after one data pass (mean $\pm$ std)
 Method SGLD pSGLD SGHMC EWSG FlyMC Accuracy(%) 75.283 $\pm$ 0.016 75.126 $\pm$ 0.020 75.268 $\pm$ 0.017 75.306 $\pm$ 0.016 75.199 $\pm$ 0.080 Log Likelihood -0.525 $\pm$ 0.000 -0.526 $\pm$ 0.000 -0.525 $\pm$ 0.000 -0.523 $\pm$ 0.000 -0.523 $\pm$ 0.000 Wall Time (s) 3.085 $\pm$ 0.283 4.312 $\pm$ 0.359 3.145 $\pm$ 0.307 3.755 $\pm$ 0.387 291.295 $\pm$ 56.368
 Method SGLD pSGLD SGHMC EWSG FlyMC Accuracy(%) 75.283 $\pm$ 0.016 75.126 $\pm$ 0.020 75.268 $\pm$ 0.017 75.306 $\pm$ 0.016 75.199 $\pm$ 0.080 Log Likelihood -0.525 $\pm$ 0.000 -0.526 $\pm$ 0.000 -0.525 $\pm$ 0.000 -0.523 $\pm$ 0.000 -0.523 $\pm$ 0.000 Wall Time (s) 3.085 $\pm$ 0.283 4.312 $\pm$ 0.359 3.145 $\pm$ 0.307 3.755 $\pm$ 0.387 291.295 $\pm$ 56.368
Test error (mean $\pm$ standard deviation) after 200 epoches
 Method Test Error(%), MLP Test Error(%), CNN SGLD 1.976 $\pm$ 0.055 0.848 $\pm$ 0.060 pSGLD 1.821 $\pm$ 0.061 0.860 $\pm$ 0.052 SGHMC 1.833 $\pm$ 0.073 0.778 $\pm$ 0.040 CP-SGHMC 1.835 $\pm$ 0.047 0.772 $\pm$ 0.055 EWSG 1.793 $\pm$ 0.100 0.753 $\pm$ 0.035
 Method Test Error(%), MLP Test Error(%), CNN SGLD 1.976 $\pm$ 0.055 0.848 $\pm$ 0.060 pSGLD 1.821 $\pm$ 0.061 0.860 $\pm$ 0.052 SGHMC 1.833 $\pm$ 0.073 0.778 $\pm$ 0.040 CP-SGHMC 1.835 $\pm$ 0.047 0.772 $\pm$ 0.055 EWSG 1.793 $\pm$ 0.100 0.753 $\pm$ 0.035
Test errors of EWSG (top of each cell) and SGHMC (bottom of each cell) after 200 epoches. $b$ is minibatch size for EWSG, and minibatch size of SGHMC is set as $b\times(M+1)$ to ensure the same number of data used per parameter update for both algorithms. Step size is set $h = \frac{10}{b(M+1)}$ as suggested in [12], different from that used to produce Table 2. Results with smaller test error is highlighted in boldface
 $b$ $M+1=2$ $M+1=5$ $M+1=10$ $100$ 1.86% 1.83% 1.80% 1.94% 1.92% 1.97% $200$ 1.90% 1.87% 1.80% 1.87% 1.97% 2.07% $500$ 1.79% 2.01% 2.36% 1.97% 2.17% 2.37%
 $b$ $M+1=2$ $M+1=5$ $M+1=10$ $100$ 1.86% 1.83% 1.80% 1.94% 1.92% 1.97% $200$ 1.90% 1.87% 1.80% 1.87% 1.97% 2.07% $500$ 1.79% 2.01% 2.36% 1.97% 2.17% 2.37%
 [1] Yanqin Bai, Yudan Wei, Qian Li. An optimal trade-off model for portfolio selection with sensitivity of parameters. Journal of Industrial & Management Optimization, 2017, 13 (2) : 947-965. doi: 10.3934/jimo.2016055 [2] Lihui Zhang, Xin Zou, Jianxun Qi. A trade-off between time and cost in scheduling repetitive construction projects. Journal of Industrial & Management Optimization, 2015, 11 (4) : 1423-1434. doi: 10.3934/jimo.2015.11.1423 [3] Reuven Cohen, Mira Gonen, Avishai Wool. Bounding the bias of tree-like sampling in IP topologies. Networks & Heterogeneous Media, 2008, 3 (2) : 323-332. doi: 10.3934/nhm.2008.3.323 [4] Reza Lotfi, Zahra Yadegari, Seyed Hossein Hosseini, Amir Hossein Khameneh, Erfan Babaee Tirkolaee, Gerhard-Wilhelm Weber. A robust time-cost-quality-energy-environment trade-off with resource-constrained in project management: A case study for a bridge construction project. Journal of Industrial & Management Optimization, 2022, 18 (1) : 375-396. doi: 10.3934/jimo.2020158 [5] Boris Kalinin, Victoria Sadovskaya. Normal forms for non-uniform contractions. Journal of Modern Dynamics, 2017, 11: 341-368. doi: 10.3934/jmd.2017014 [6] Yakov Pesin, Vaughn Climenhaga. Open problems in the theory of non-uniform hyperbolicity. Discrete & Continuous Dynamical Systems, 2010, 27 (2) : 589-607. doi: 10.3934/dcds.2010.27.589 [7] Pablo G. Barrientos, Abbas Fakhari. Ergodicity of non-autonomous discrete systems with non-uniform expansion. Discrete & Continuous Dynamical Systems - B, 2020, 25 (4) : 1361-1382. doi: 10.3934/dcdsb.2019231 [8] Markus Bachmayr, Van Kien Nguyen. Identifiability of diffusion coefficients for source terms of non-uniform sign. Inverse Problems & Imaging, 2019, 13 (5) : 1007-1021. doi: 10.3934/ipi.2019045 [9] Keaton Hamm, Longxiu Huang. Stability of sampling for CUR decompositions. Foundations of Data Science, 2020, 2 (2) : 83-99. doi: 10.3934/fods.2020006 [10] Deng Lu, Maria De Iorio, Ajay Jasra, Gary L. Rosner. Bayesian inference for latent chain graphs. Foundations of Data Science, 2020, 2 (1) : 35-54. doi: 10.3934/fods.2020003 [11] Sahani Pathiraja, Sebastian Reich. Discrete gradients for computational Bayesian inference. Journal of Computational Dynamics, 2019, 6 (2) : 385-400. doi: 10.3934/jcd.2019019 [12] Vladimir Kazakov. Sampling - reconstruction procedure with jitter of markov continuous processes formed by stochastic differential equations of the first order. Conference Publications, 2009, 2009 (Special) : 433-441. doi: 10.3934/proc.2009.2009.433 [13] Christopher Rackauckas, Qing Nie. Adaptive methods for stochastic differential equations via natural embeddings and rejection sampling with memory. Discrete & Continuous Dynamical Systems - B, 2017, 22 (7) : 2731-2761. doi: 10.3934/dcdsb.2017133 [14] Zhong-Jie Han, Gen-Qi Xu. Spectrum and dynamical behavior of a kind of planar network of non-uniform strings with non-collocated feedbacks. Networks & Heterogeneous Media, 2010, 5 (2) : 315-334. doi: 10.3934/nhm.2010.5.315 [15] Alexandre J. Chorin, Fei Lu, Robert N. Miller, Matthias Morzfeld, Xuemin Tu. Sampling, feasibility, and priors in data assimilation. Discrete & Continuous Dynamical Systems, 2016, 36 (8) : 4227-4246. doi: 10.3934/dcds.2016.36.4227 [16] Shixu Meng. A sampling type method in an electromagnetic waveguide. Inverse Problems & Imaging, 2021, 15 (4) : 745-762. doi: 10.3934/ipi.2021012 [17] Evangelos Evangelou. Approximate Bayesian inference for geostatistical generalised linear models. Foundations of Data Science, 2019, 1 (1) : 39-60. doi: 10.3934/fods.2019002 [18] Donald L. DeAngelis, Bo Zhang. Effects of dispersal in a non-uniform environment on population dynamics and competition: A patch model approach. Discrete & Continuous Dynamical Systems - B, 2014, 19 (10) : 3087-3104. doi: 10.3934/dcdsb.2014.19.3087 [19] Zhong-Jie Han, Gen-Qi Xu. Exponential decay in non-uniform porous-thermo-elasticity model of Lord-Shulman type. Discrete & Continuous Dynamical Systems - B, 2012, 17 (1) : 57-77. doi: 10.3934/dcdsb.2012.17.57 [20] Izumi Takagi, Conghui Zhang. Existence and stability of patterns in a reaction-diffusion-ODE system with hysteresis in non-uniform media. Discrete & Continuous Dynamical Systems, 2021, 41 (7) : 3109-3140. doi: 10.3934/dcds.2020400

2020 Impact Factor: 2.425