Algorithm | rel. ESS (av.) | rel. ESJD (av.) | rel. Time (av.) | rel. gain (ESS)(av.) | rel. gain (ESJD)(av.) |
DA-MH over MH | 1.1066 | 12.962 | 0.098 | 5.47 | 56.18 |
MCMC algorithms such as Metropolis--Hastings algorithms are slowed down by the computation of complex target distributions as exemplified by huge datasets. We offer a useful generalisation of the Delayed Acceptance approach, devised to reduce such computational costs by a simple and universal divide-and-conquer strategy. The generic acceleration stems from breaking the acceptance step into several parts, aiming at a major gain in computing time that out-ranks a corresponding reduction in acceptance probability. Each component is sequentially compared with a uniform variate, the first rejection terminating this iteration. We develop theoretical bounds for the variance of associated estimators against the standard Metropolis--Hastings and produce results on optimal scaling and general optimisation of the procedure.
Citation: |
Figure 1. Fit of a two-step Metropolis-Hastings algorithm applied to a normal-normal posterior distribution $ \mu|x\sim N(x/(\{1+\sigma_\mu^{-2}\}, 1/\{1+\sigma_\mu^{-2}\}) $ when $ x = 3 $ and $ \sigma_\mu = 10 $, based on $ T = 10^5 $ iterations and a first acceptance step considering the likelihood ratio and a second acceptance step considering the prior ratio, resulting in an overall acceptance rate of 12%
Figure 2. (left) Fit of a multiple-step Metropolis-Hastings algorithm applied to a Beta-binomial posterior distribution $ p|x\sim Be(x+a, n+b-x) $ when $ N = 100 $, $ x = 32 $, $ a = 7.5 $ and $ b = .5 $. The binomial $ \mathcal{B}(N, p) $ likelihood is replaced with a product of $ 100 $ Bernoulli terms and an acceptance step is considered for the ratio of each term. The histogram is based on $ 10^5 $ iterations, with an overall acceptance rate of 9%; (centre) raw sequence of successive values of $ p $ in the Markov chain simulated in the above experiment; (right) autocorrelogram of the above sequence
Figure 3. Two top panels: behaviour of $\ell^*(\delta)$ and $\alpha^*(\delta)$ as the relative cost varies. Note that for $\delta >> 1$ the optimal values converges towards the values computed for the standard Metropolis--Hastings (dashed in red). Two bottom panels: close--up of the interesting region for $0 < \delta < 1$.
Figure 4. Optimal acceptance rate for the DA-MALA algorithm as a function of $\delta$. In red, the optimal acceptance rate for MALA obtained by [27] is met for $\delta = 1$.
Table 1. Comparison between MH and MH with Delayed Acceptance on a logistic model. ESS is the effective sample size, ESJD the expected square jumping distance, time is the computation time
Algorithm | rel. ESS (av.) | rel. ESJD (av.) | rel. Time (av.) | rel. gain (ESS)(av.) | rel. gain (ESJD)(av.) |
DA-MH over MH | 1.1066 | 12.962 | 0.098 | 5.47 | 56.18 |
Table 2. Comparison between standard geometric MALA and geometric MALA with Delayed Acceptance, with ESS the effective sample size, ESJD the expected square jumping distance, time the computation time and a the observed acceptance rate
Algorithm | ESS (av.) | (sd) | ESJD (av.) | (sd) | time (av.) | (sd) | a(aver.) | ESS/time (aver.) | ESJD/time (aver.) |
MALA | 7504.48 | 107.21 | 5244.94 | 983.47 | 176078 | 1562.3 | 0.661 | 0.04 | 0.03 |
DA-MALA | 6081.02 | 121.42 | 5373.253 | 2148.76 | 17342.91 | 6688.3 | 0.09 | 0.35 | 0.31 |
Table 3.
Comparison using different performance indicators in the example of mixture estimation, based on 100 replicas of the experiments according to model (9) with a sample size
Algorithm | ESS (av.) | (sd) | ESJD (av.) | (sd) | time (av.) | (sd) |
MH | 1575.96 | 245.96 | 0.226 | 0.44 | 513.95 | 57.81 |
MH + DA | 628.77 | 87.86 | 0.215 | 0.45 | 42.22 | 22.95 |
[1] |
C. Andrieu, A. Lee and M. Vihola, et al., Uniform ergodicity of the iterated conditional SMC and geometric ergodicity of particle Gibbs samplers, Bernoulli, 24 (2018), 842-872.
doi: 10.3150/15-BEJ785.![]() ![]() ![]() |
[2] |
E. Angelino, E. Kohler, A. Waterland, M. Seltzer and R. Adams, Accelerating MCMC via parallel predictive prefetching, UAI'14 Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, (2014), 22-31.
![]() |
[3] |
R. Bardenet, A. Doucet and C. Holmes, On Markov chain Monte Carlo methods for tall data, The Journal of Machine Learning Research, 18 (2017), Paper No. 47, 43 pp.
![]() ![]() |
[4] |
A. Brockwell, Parallel Markov chain Monte Carlo simulation by pre-fetching, J. Comput. Graphical Stat., 15 (2006), 246-261.
doi: 10.1198/106186006X100579.![]() ![]() ![]() |
[5] |
J. Christen and C. Fox, Markov chain Monte Carlo using an approximation, Journal of Computational and Graphical Statistics, 14 (2005), 795-810.
doi: 10.1198/106186005X76983.![]() ![]() ![]() |
[6] |
O. F. Christensen, G. O. Roberts and J. S. Rosenthal, Scaling limits for the transient phase of local Metropolis-Hastings algorithms, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67 (2005), 253-268.
doi: 10.1111/j.1467-9868.2005.00500.x.![]() ![]() ![]() |
[7] |
R. Cornish, P. Vanetti, A. Bouchard-Côté, G. Deligiannidis and A. Doucet, Scalable Metropolis-Hastings for exact Bayesian inference with large datasets, arXiv preprint, arXiv: 1901.09881.
![]() |
[8] |
L. Devroye, Nonuniform Random Variate Generation, Springer-Verlag, New York, 1986.
doi: 10.1007/978-1-4613-8643-8.![]() ![]() ![]() |
[9] |
J. Diebolt and C. P. Robert, Estimation of finite mixture distributions by Bayesian sampling, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 56 (1994), 363-375.
doi: 10.1111/j.2517-6161.1994.tb01985.x.![]() ![]() ![]() |
[10] |
C. Fox and G. Nicholls, Sampling conductivity images via MCMC, The Art and Science of Bayesian Image Analysis, (1997), 91-100.
![]() |
[11] |
A. Gelfand and S. Sahu, On Markov chain Monte Carlo acceleration, J. Comput. Graph. Statist., 3 (1994), 261-276.
doi: 10.2307/1390911.![]() ![]() ![]() |
[12] |
M. Girolami and B. Calderhead, Riemann manifold Langevin and Hamiltonian Monte Carlo methods, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73 (2011), 123-214.
doi: 10.1111/j.1467-9868.2010.00765.x.![]() ![]() ![]() |
[13] |
A. Golightly, D. Henderson and C. Sherlock, Delayed acceptance particle MCMC for exact inference in stochastic kinetic models, Statistics and Computing, 25 (2015), 1039-1055.
doi: 10.1007/s11222-014-9469-x.![]() ![]() ![]() |
[14] |
H. Jeffreys, Theory of Probability, 1st ed. The Clarendon Press, Oxford, 1939.
![]() ![]() |
[15] |
A. Korattikara, Y Chen and M. Welling, Austerity in MCMC land: Cutting the Metropolis-Hastings budget, In ICML 2014, International Conference on Machine Learning, (2014), 181-189.
![]() |
[16] |
G. MacLachlan and D. Peel, Finite Mixture Models, John Wiley, New York, 2000.
doi: 10.1002/0471721182.![]() ![]() ![]() |
[17] |
K.L. Mengersen and R. Tweedie, Rates of convergence of the Hastings and Metropolis algorithms, Ann. Statist., 24 (1996), 101-121.
doi: 10.1214/aos/1033066201.![]() ![]() ![]() |
[18] |
P. Neal and G. O. Roberts, Optimal scaling of random walk Metropolis algorithms with non-Gaussian proposals, Methodology and Computing in Applied Probability, 13 (2011), 583-601.
doi: 10.1007/s11009-010-9176-9.![]() ![]() ![]() |
[19] |
R. Neal, Markov chain Monte Carlo methods based on 'slicing' the density function, Tech. rep., University of Toronto, 1997.
![]() |
[20] |
W. Neiswanger, C. Wang and E. Xing, Asymptotically exact, embarrassingly parallel MCMC, arXiv preprint, 2013, arXiv: 1311.4780.
![]() |
[21] |
P. Peskun, Optimum Monte Carlo sampling using Markov chains, Biometrika, 60 (1973), 607-612.
doi: 10.1093/biomet/60.3.607.![]() ![]() ![]() |
[22] |
M. Plummer, N. Best, K. Cowles and K. Vines, CODA: Convergence diagnosis and output analysis for MCMC, R News, 6 (2006), 7-11.
![]() |
[23] |
C. P. Robert, The Bayesian Choice, 2nd ed. Springer-Verlag, New York, 2001.
![]() ![]() |
[24] |
C. P. Robert and G. Casella, Monte Carlo Statistical Methods, 2nd ed. Springer-Verlag, New York, 2004.
doi: 10.1007/978-1-4757-4145-2.![]() ![]() ![]() |
[25] |
C. P. Robert and D. M. Titterington, Reparameterisation strategies for hidden Markov models and Bayesian approaches to maximum likelihood estimation, Statistics and Computing, 8 (1998), 145-158.
![]() |
[26] |
G. O. Roberts, A. Gelman and W. R. Gilks, Weak convergence and optimal scaling of random walk Metropolis algorithms, Ann. Appl. Probab., 7 (1997), 110-120.
doi: 10.1214/aoap/1034625254.![]() ![]() ![]() |
[27] |
G. O. Roberts and J. S. Rosenthal, Optimal scaling for various Metropolis-Hastings algorithms, Statist. Science, 16 (2001), 351-367.
doi: 10.1214/ss/1015346320.![]() ![]() ![]() |
[28] |
G. O. Roberts and J. S. Rosenthal, Coupling and ergodicity of adaptive MCMC, J. Applied Proba., 44 (2005), 458-475.
doi: 10.1239/jap/1183667414.![]() ![]() ![]() |
[29] |
G. O. Roberts and O. Stramer, Langevin diffusions and Metropolis-Hastings algorithms, Methodology and Computing in Applied Probability, 4 (2002), 337-357.
doi: 10.1023/A:1023562417138.![]() ![]() ![]() |
[30] |
G. O. Roberts and R. Tweedie, Geometric convergence and central limit theorems for multidimensional Hastings and Metropolis algorithms, Biometrika, 83 (1996), 95-110.
doi: 10.1093/biomet/83.1.95.![]() ![]() ![]() |
[31] |
K. Roeder and L. Wasserman, Practical Bayesian density estimation using mixtures of Normals, J. American Statist. Assoc., 92 (1997), 894-902.
doi: 10.1080/01621459.1997.10474044.![]() ![]() ![]() |
[32] |
S. Scott, A. Blocker, F. Bonassi, M. Chipman, E. George and R. McCulloch, Bayes and big data: The consensus Monte Carlo algorithm, EFaBBayes 250 Conference, 16 (2013).
![]() |
[33] |
C. Sherlock, A. Golightly and D. A. Henderson, Adaptive, delayed-acceptance MCMC for targets with expensive likelihoods, Journal of Computational and Graphical Statistics, 26 (2017), 434-444.
doi: 10.1080/10618600.2016.1231064.![]() ![]() ![]() |
[34] |
C. Sherlock and G. O. Roberts, Optimal scaling of the random walk Metropolis on elliptically symmetric unimodal targets, Bernoulli, 15 (2009), 774-798.
doi: 10.3150/08-BEJ176.![]() ![]() ![]() |
[35] |
C. Sherlock, A. Thiery and A. Golightly, Efficiency of delayed-acceptance random walk Metropolis algorithms, arXiv preprint, 2015, arXiv: 1506.08155.
![]() |
[36] |
C. Sherlock, A. Thiery, G. O. Roberts and J. S. Rosenthal, On the efficiency of pseudo-marginal random walk Metropolis algorithms, The Annals of Statistics, 43 (2015), 238-275.
doi: 10.1214/14-AOS1278.![]() ![]() ![]() |
[37] |
A. Y. Shestopaloff and R. M. Neal, MCMC for non-linear state space models using ensembles of latent sequences, arXiv preprint, 2013, arXiv: 1305.0320.
![]() |
[38] |
M. Stephens, Bayesian Methods for Mixtures of Normal Distributions, Ph.D. thesis, University of Oxford, 1997.
![]() |
[39] |
I. Strid, Efficient parallelisation of Metropolis-Hastings algorithms using a prefetching approach, Computational Statistics & Data Analysis, 54 (2010), 2814-2835.
doi: 10.1016/j.csda.2009.11.019.![]() ![]() ![]() |
[40] |
L. Tierney, A note on Metropolis-Hastings kernels for general state spaces, Ann. Appl. Probab., 8 (1998), 1-9.
doi: 10.1214/aoap/1027961031.![]() ![]() ![]() |
[41] |
L. Tierney and A. Mira, Some adaptive Monte Carlo methods for Bayesian inference, Statistics in Medicine, 18 (1998), 2507-2515.
![]() |
[42] |
X. Wang and D. Dunson, Parallel MCMC via Weierstrass sampler, arXiv preprint, 2013, arXiv: 1312.4605.
![]() |
Fit of a two-step Metropolis-Hastings algorithm applied to a normal-normal posterior distribution
(left) Fit of a multiple-step Metropolis-Hastings algorithm applied to a Beta-binomial posterior distribution
Two top panels: behaviour of $\ell^*(\delta)$ and $\alpha^*(\delta)$ as the relative cost varies. Note that for $\delta >> 1$ the optimal values converges towards the values computed for the standard Metropolis--Hastings (dashed in red). Two bottom panels: close--up of the interesting region for $0 < \delta < 1$.
Optimal acceptance rate for the DA-MALA algorithm as a function of $\delta$. In red, the optimal acceptance rate for MALA obtained by [27] is met for $\delta = 1$.
Comparison between geometric MALA (top panels) and geometric MALA with Delayed Acceptance (bottom panels): marginal chains for two arbitrary components (left), estimated marginal posterior density for an arbitrary component (middle), 1D chain trace evaluating mixing (right).