# American Institute of Mathematical Sciences

June  2019, 1(2): 103-128. doi: 10.3934/fods.2019005

## Accelerating Metropolis-Hastings algorithms by Delayed Acceptance

 1 Department of Medical Statistics, London School of Hygiene and Tropical Medicine, Keppel St, Bloomsbury, London WC1E 7HT, UK 2 Dipartimento di Economia, Università degli Studi "Gabriele D'Annunzio", Viale Pindaro, 42, 65127 Pescara, Italy 3 School of Mathematics, University of Bristol, University Walk, Bristol BS8 1TW, UK 4 Department of Statistics, University of Warwick, Gibbet Hill Road, Coventry CV4 7AL, UK

* Corresponding author: Christian Robert

Published  April 2019

MCMC algorithms such as Metropolis--Hastings algorithms are slowed down by the computation of complex target distributions as exemplified by huge datasets. We offer a useful generalisation of the Delayed Acceptance approach, devised to reduce such computational costs by a simple and universal divide-and-conquer strategy. The generic acceleration stems from breaking the acceptance step into several parts, aiming at a major gain in computing time that out-ranks a corresponding reduction in acceptance probability. Each component is sequentially compared with a uniform variate, the first rejection terminating this iteration. We develop theoretical bounds for the variance of associated estimators against the standard Metropolis--Hastings and produce results on optimal scaling and general optimisation of the procedure.

Citation: Marco Banterle, Clara Grazian, Anthony Lee, Christian P. Robert. Accelerating Metropolis-Hastings algorithms by Delayed Acceptance. Foundations of Data Science, 2019, 1 (2) : 103-128. doi: 10.3934/fods.2019005
##### References:

show all references

##### References:
Fit of a two-step Metropolis-Hastings algorithm applied to a normal-normal posterior distribution $\mu|x\sim N(x/(\{1+\sigma_\mu^{-2}\}, 1/\{1+\sigma_\mu^{-2}\})$ when $x = 3$ and $\sigma_\mu = 10$, based on $T = 10^5$ iterations and a first acceptance step considering the likelihood ratio and a second acceptance step considering the prior ratio, resulting in an overall acceptance rate of 12%
(left) Fit of a multiple-step Metropolis-Hastings algorithm applied to a Beta-binomial posterior distribution $p|x\sim Be(x+a, n+b-x)$ when $N = 100$, $x = 32$, $a = 7.5$ and $b = .5$. The binomial $\mathcal{B}(N, p)$ likelihood is replaced with a product of $100$ Bernoulli terms and an acceptance step is considered for the ratio of each term. The histogram is based on $10^5$ iterations, with an overall acceptance rate of 9%; (centre) raw sequence of successive values of $p$ in the Markov chain simulated in the above experiment; (right) autocorrelogram of the above sequence
Two top panels: behaviour of $\ell^*(\delta)$ and $\alpha^*(\delta)$ as the relative cost varies. Note that for $\delta >> 1$ the optimal values converges towards the values computed for the standard Metropolis--Hastings (dashed in red). Two bottom panels: close--up of the interesting region for $0 < \delta < 1$.
Optimal acceptance rate for the DA-MALA algorithm as a function of $\delta$. In red, the optimal acceptance rate for MALA obtained by [27] is met for $\delta = 1$.
Comparison between geometric MALA (top panels) and geometric MALA with Delayed Acceptance (bottom panels): marginal chains for two arbitrary components (left), estimated marginal posterior density for an arbitrary component (middle), 1D chain trace evaluating mixing (right).
Comparison between MH and MH with Delayed Acceptance on a logistic model. ESS is the effective sample size, ESJD the expected square jumping distance, time is the computation time
 Algorithm rel. ESS (av.) rel. ESJD (av.) rel. Time (av.) rel. gain (ESS)(av.) rel. gain (ESJD)(av.) DA-MH over MH 1.1066 12.962 0.098 5.47 56.18
 Algorithm rel. ESS (av.) rel. ESJD (av.) rel. Time (av.) rel. gain (ESS)(av.) rel. gain (ESJD)(av.) DA-MH over MH 1.1066 12.962 0.098 5.47 56.18
Comparison between standard geometric MALA and geometric MALA with Delayed Acceptance, with ESS the effective sample size, ESJD the expected square jumping distance, time the computation time and a the observed acceptance rate
 Algorithm ESS (av.) (sd) ESJD (av.) (sd) time (av.) (sd) a(aver.) ESS/time (aver.) ESJD/time (aver.) MALA 7504.48 107.21 5244.94 983.47 176078 1562.3 0.661 0.04 0.03 DA-MALA 6081.02 121.42 5373.253 2148.76 17342.91 6688.3 0.09 0.35 0.31
 Algorithm ESS (av.) (sd) ESJD (av.) (sd) time (av.) (sd) a(aver.) ESS/time (aver.) ESJD/time (aver.) MALA 7504.48 107.21 5244.94 983.47 176078 1562.3 0.661 0.04 0.03 DA-MALA 6081.02 121.42 5373.253 2148.76 17342.91 6688.3 0.09 0.35 0.31
Comparison using different performance indicators in the example of mixture estimation, based on 100 replicas of the experiments according to model (9) with a sample size $n = 500$, $10^5$ MH simulations and $500$ samples for the prior estimation. ("ESS" is the effective sample size, "time" is the computational time). The actual averaged gain ($\frac{ESS_{DA}/ESS_{MH}}{time_{DA}/time_{MH}}$) is $9.58$, higher than the "double average" that the table above suggests as being around $5$
 Algorithm ESS (av.) (sd) ESJD (av.) (sd) time (av.) (sd) MH 1575.96 245.96 0.226 0.44 513.95 57.81 MH + DA 628.77 87.86 0.215 0.45 42.22 22.95
 Algorithm ESS (av.) (sd) ESJD (av.) (sd) time (av.) (sd) MH 1575.96 245.96 0.226 0.44 513.95 57.81 MH + DA 628.77 87.86 0.215 0.45 42.22 22.95
 [1] Xiangmin Zhang. User perceived learning from interactive searching on big medical literature data. Big Data & Information Analytics, 2018  doi: 10.3934/bdia.2017019 [2] Tieliang Gong, Qian Zhao, Deyu Meng, Zongben Xu. Why curriculum learning & self-paced learning work in big/noisy data: A theoretical perspective. Big Data & Information Analytics, 2016, 1 (1) : 111-127. doi: 10.3934/bdia.2016.1.111 [3] Linfei Wang, Dapeng Tao, Ruonan Wang, Ruxin Wang, Hao Li. Big Map R-CNN for object detection in large-scale remote sensing images. Mathematical Foundations of Computing, 2019, 2 (4) : 299-314. doi: 10.3934/mfc.2019019 [4] H.T. Banks, Jimena L. Davis. Quantifying uncertainty in the estimation of probability distributions. Mathematical Biosciences & Engineering, 2008, 5 (4) : 647-667. doi: 10.3934/mbe.2008.5.647 [5] Danuta Gaweł, Krzysztof Fujarewicz. On the sensitivity of feature ranked lists for large-scale biological data. Mathematical Biosciences & Engineering, 2013, 10 (3) : 667-690. doi: 10.3934/mbe.2013.10.667 [6] Nick Cercone, F'IEEE. What's the big deal about big data?. Big Data & Information Analytics, 2016, 1 (1) : 31-79. doi: 10.3934/bdia.2016.1.31 [7] Richard Boire. Understanding AI in a world of big data. Big Data & Information Analytics, 2018  doi: 10.3934/bdia.2018001 [8] Masataka Kato, Hiroyuki Masuyama, Shoji Kasahara, Yutaka Takahashi. Effect of energy-saving server scheduling on power consumption for large-scale data centers. Journal of Industrial & Management Optimization, 2016, 12 (2) : 667-685. doi: 10.3934/jimo.2016.12.667 [9] Pankaj Sharma, David Baglee, Jaime Campos, Erkki Jantunen. Big data collection and analysis for manufacturing organisations. Big Data & Information Analytics, 2017, 2 (2) : 127-139. doi: 10.3934/bdia.2017002 [10] Enrico Capobianco. Born to be big: Data, graphs, and their entangled complexity. Big Data & Information Analytics, 2016, 1 (2&3) : 163-169. doi: 10.3934/bdia.2016002 [11] Ali Asgary, Jianhong Wu. ADERSIM-IBM partnership in big data. Big Data & Information Analytics, 2016, 1 (4) : 277-278. doi: 10.3934/bdia.2016010 [12] Francis Ribaud. Semilinear parabolic equations with distributions as initial data. Discrete & Continuous Dynamical Systems - A, 1997, 3 (3) : 305-316. doi: 10.3934/dcds.1997.3.305 [13] Weidong Bao, Wenhua Xiao, Haoran Ji, Chao Chen, Xiaomin Zhu, Jianhong Wu. Towards big data processing in clouds: An online cost-minimization approach. Big Data & Information Analytics, 2016, 1 (1) : 15-29. doi: 10.3934/bdia.2016.1.15 [14] Yang Yu. Introduction: Special issue on computational intelligence methods for big data and information analytics. Big Data & Information Analytics, 2017, 2 (1) : i-ii. doi: 10.3934/bdia.201701i [15] Yaguang Huangfu, Guanqing Liang, Jiannong Cao. MatrixMap: Programming abstraction and implementation of matrix computation for big data analytics. Big Data & Information Analytics, 2016, 1 (4) : 349-376. doi: 10.3934/bdia.2016015 [16] A. Mittal, N. Hemachandra. Learning algorithms for finite horizon constrained Markov decision processes. Journal of Industrial & Management Optimization, 2007, 3 (3) : 429-444. doi: 10.3934/jimo.2007.3.429 [17] Jian Mao, Qixiao Lin, Jingdong Bian. Application of learning algorithms in smart home IoT system security. Mathematical Foundations of Computing, 2018, 1 (1) : 63-76. doi: 10.3934/mfc.2018004 [18] Roberto C. Alamino, Nestor Caticha. Bayesian online algorithms for learning in discrete hidden Markov models. Discrete & Continuous Dynamical Systems - B, 2008, 9 (1) : 1-10. doi: 10.3934/dcdsb.2008.9.1 [19] Shichu Chen, Zhiqiang Wang, Yan Ren. A fast matching algorithm for the images with large scale disparity. Mathematical Foundations of Computing, 2020, 3 (3) : 141-155. doi: 10.3934/mfc.2020021 [20] Prashant Shekhar, Abani Patra. Hierarchical approximations for data reduction and learning at multiple scales. Foundations of Data Science, 2020, 2 (2) : 123-154. doi: 10.3934/fods.2020008

Impact Factor:

## Tools

Article outline

Figures and Tables