Stein variational gradient descent (SVGD) refers to a class of methods for Bayesian inference based on interacting particle systems. In this paper, we consider the originally proposed deterministic dynamics as well as a stochastic variant, each of which represent one of the two main paradigms in Bayesian computational statistics: variational inference and Markov chain Monte Carlo. As it turns out, these are tightly linked through a correspondence between gradient flow structures and large-deviation principles rooted in statistical physics. To expose this relationship, we develop the cotangent space construction for the Stein geometry, prove its basic properties, and determine the large-deviation functional governing the many-particle limit for the empirical measure. Moreover, we identify the Stein-Fisher information (or kernelised Stein discrepancy) as its leading order contribution in the long-time and many-particle regime in the sense of $ \Gamma $-convergence, shedding some light on the finite-particle properties of SVGD. Finally, we establish a comparison principle between the Stein-Fisher information and RKHS-norms that might be of independent interest.
Citation: |
[1] | L. Ambrogioni, U. Guclu, Y. Gucluturk and M. van Gerven, Wasserstein variational gradient descent: From semi-discrete optimal transport to ensemble variational inference, arXiv: 1811.02827, 2018. |
[2] | L. Ambrosio, N. Gigli and G. Savaré, Gradient Flows: In Metric Spaces and in the Space of Probability Measures, 2nd edition, Lectures in Mathematics ETH Zürich. Birkhäuser Verlag, Basel, 2008. |
[3] | M. Arbel, A. Korba, A. Salim and A. Gretton, Maximum mean discrepancy gradient flow, In Advances in Neural Information Processing Systems 32, 2019. |
[4] | A. Berlinet and C. Thomas-Agnan, Reproducing Kernel Hilbert Spaces in Probability and Statistics, Kluwer Academic Publishers, Boston, MA, 2004. doi: 10.1007/978-1-4419-9096-9. |
[5] | L. Bertini, A. De Sole, D. Gabrielli, G. Jona-Lasinio and C. Landim, Large deviations of the empirical current in interacting particle systems, Theory of Probability & Its Applications, 51 (2007), 2-27. doi: 10.1137/S0040585X97982256. |
[6] | P. Billingsley, Convergence of Probability Measures, John Wiley & Sons, Inc., New York-London-Sydney, 1968. |
[7] | C. M. Bishop, Pattern Recognition and Machine Learning, Springer, New York, 2006. |
[8] | D. M. Blei, A. Kucukelbir and J. D. McAuliffe, Variational inference: A review for statisticians, J. Amer. Statist. Assoc., 112 (2017), 859-877. doi: 10.1080/01621459.2017.1285773. |
[9] | V. Bogachev, Measure Theory Vol. I and II, Springer, Berlin, Germany, 2007. doi: 10.1007/978-3-540-34514-5. |
[10] | A. Braides, Gamma Convergence for Beginners, Oxford University Press, Oxford, UK, 2002. doi: 10.1093/acprof:oso/9780198507840.001.0001. |
[11] | S. Brooks, A. Gelman, G. Jones and X.-L. Meng, Handbook of Markov Chain Monte Carlo, CRC Press, Boca Raton, FL, 2011. doi: 10.1201/b10905. |
[12] | C. Chen and R. Zhang, Particle optimization in stochastic gradient MCMC, arXiv: 1711.10927, 2017. |
[13] | C. Chen, R. Zhang, W. Wang, B. Li and L. Chen, A unified particle-optimization framework for scalable Bayesian sampling, In Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, UAI, AUAI Press, (2018), 746-755. |
[14] | S. Chewi, T. L. Gouic, C. Lu, T. Maunu and P. Rigollet, SVGD as a kernelized Wasserstein gradient flow of the chi-squared divergence, In Advances in Neural Information Processing Systems 33, 2020. |
[15] | K. Chwialkowski, H. Strathmann and A. Gretton, A kernel test of goodness of fit, In International Conference on Machine Learning, PMLR, (2016), 2606-2615. |
[16] | S. Daneri and G. Savaré, Eulerian calculus for the displacement convexity in the Wasserstein distance, SIAM J. Math. Anal., 40 (2008), 1104-1122. doi: 10.1137/08071346X. |
[17] | D. Dawson, Measure-valued Markov processes, In Ecole d'Eté de Probabilités de Saint-Flour XXI - 1991, Berlin-Heidelberg, Germany, Springer, 1541 (1993), 1-260. doi: 10.1007/BFb0084190. |
[18] | D. Dawson and J. Gärtner, Large deviations from the McKean-Vlasov limit for weakly interacting diffusions, Stochastics, 20 (1987), 247-308. doi: 10.1080/17442508708833446. |
[19] | P. Del Moral, Feynman-Kac Formulae, Springer-Verlag, New York, 2004. doi: 10.1007/978-1-4684-9393-1. |
[20] | A. Dembo and O. Zeitouni, Large Deviations Techniques and Applications, Corrected reprint of the second (1998) edition. Stochastic Modelling and Applied Probability, 38. Springer-Verlag, Berlin, 2010. doi: 10.1007/978-3-642-03311-7. |
[21] | H. Dietert, et al., Characterisation of gradient flows on finite state Markov chains, Electron. Commun. Probab., 20 (2015), 8 pp. doi: 10.1214/ECP.v20-3521. |
[22] | A. Doucet, N. De Freitas and N. Gordon, An introduction to sequential Monte Carlo methods, InStat. Eng. Inf. Sci., Springer, New York, (2001), 3-14. doi: 10.1007/978-1-4757-3437-9_1. |
[23] | J. J. Duistermaat and J. A. Kolk, Distributions, Cornerstones. Birkhäuser Boston, Inc., Boston, MA, 2010. doi: 10.1007/978-0-8176-4675-2. |
[24] | A. Duncan, N. Nuesken and L. Szpruch, On the geometry of Stein variational gradient descent, preprint, arXiv: 1912.00894, 2019. |
[25] | J. Feng and T. Kurtz, Large Deviations for Stochastic Processes, American Mathematical Society, Providence, RI, USA, 2006. doi: 10.1090/surv/131. |
[26] | M. Fisher, T. Nolan, M. Graham, D. Prangle and C. J. Oates, Measure transport with kernel Stein discrepancy, In The 24th International Conference on Artificial Intelligence and Statistics, AISTATS, Proceedings of Machine Learning Research. PMLR, 2021. |
[27] | M. I. Freidlin and A. D. Wentzell, Random Perturbations of Dynamical Systems, volume 260. Springer-Verlag, New York, 1984. doi: 10.1007/978-1-4684-0176-9. |
[28] | K. Fukumizu, A. Gretton, G. R. Lanckriet, B. Schölkopf and B. K. Sriperumbudur, Kernel choice and classifiability for RKHS embeddings of probability distributions, In Advances in Neural Information Processing Systems, (2009), 1750-1758. |
[29] | V. Gallego and D. R. Insua, Stochastic gradient MCMC with repulsive forces, arXiv: 1812.00071, 2018. |
[30] | A. Garbuno-Inigo, F. Hoffmann, W. Li and A. M. Stuart, Interacting Langevin diffusions: Gradient structure and ensemble Kalman sampler, SIAM J. Appl. Dyn. Syst., 19 (2020), 412-441. doi: 10.1137/19M1251655. |
[31] | A. Garbuno-Inigo, N. Nüsken and S. Reich, Affine invariant interacting Langevin dynamics for Bayesian inference, SIAM J. Appl. Dyn. Syst., 19 (2020), 1633-1658. doi: 10.1137/19M1304891. |
[32] | A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari and D. B. Rubin, Bayesian Data Analysis, 3rd edition, CRC Press, Boca Raton, FL, 2014. |
[33] | N. Gigli, Second Order Analysis on $(\mathcal{P}_2(M), W_2)$, Mem. Amer. Math. Soc. 216 (2012). doi: 10.1090/S0065-9266-2011-00619-2. |
[34] | I. Goodfellow, Y. Bengio and A. Courville, Deep Learning, Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, 2016, http://www.deeplearningbook.org. |
[35] | J. Gorham and L. Mackey, Measuring sample quality with kernels, In International Conference on Machine Learning, (2017), 1292-1301. |
[36] | C. Hartmann and C. Schütte, Efficient rare event simulation by optimal nonequilibrium forcing, J. Statistical Mechanics: Theory and Experiment, 2012 (2012). |
[37] | M. D. Hoffman, Learning deep latent gaussian models with Markov chain Monte Carlo, In International Conference on Machine Learning, (2017), 1510-1519. |
[38] | R. Jordan, D. Kinderlehrer and F. Otto, The variational formulation of the Fokker–Planck equation, SIAM J. Math. Anal., 29 (1998), 1-17. doi: 10.1137/S0036141096303359. |
[39] | M. Kanagawa, P. Hennig, D. Sejdinovic and B. K. Sriperumbudur, Gaussian processes and kernel methods: A review on connections and equivalences, arXiv: 1807.02582, 2018. |
[40] | C. Kipnis and C. Landim, Scaling Limits of Interacting Particle Systems, Springer-Verlag, Berlin, 1999. doi: 10.1007/978-3-662-03752-2. |
[41] | A. Korba, A. Salim, M. Arbel, G. Luise and A. Gretton, A non-asymptotic analysis for Stein variational gradient descent, In Advances in Neural Information Processing Systems 33, 2020. |
[42] | E. Kreyszig, Introductory Functional Analysis with Applications, volume 1. John Wiley & Sons, New York-London-Sydney, 1978. |
[43] | J. M. Lee, Riemannian Manifolds: An Introduction to Curvature, volume 176. Graduate Texts in Mathematics, 176. Springer-Verlag, New York, 1997. doi: 10.1007/b98852. |
[44] | C. Liu, J. Zhuo, P. Cheng, R. Zhang and J. Zhu, Understanding and accelerating particle-based variational inference, In International Conference on Machine Learning, (2019), 4082-4092. |
[45] | Q. Liu, Stein variational gradient descent as gradient flow, In Advances in Neural Information Processing Systems, (2017), 3115-3123. |
[46] | Q. Liu, J. Lee and M. Jordan, A kernelized Stein discrepancy for goodness-of-fit tests, In International Conference on Machine Learning, (2016), 276-284. |
[47] | Q. Liu and D. Wang, Stein variational gradient descent: A general purpose Bayesian inference algorithm, In Advances In Neural Information Processing Systems, (2016), 2378-2386. |
[48] | Q. Liu and D. Wang, Stein variational gradient descent as moment matching, In Advances in Neural Information Processing Systems, (2018), 8868-8877. |
[49] | A. Liutkus, U. Simsekli, S. Majewski, A. Durmus and F.-R. Stöter, Sliced-Wasserstein flows: Nonparametric generative modeling via optimal transport and diffusions, In International Conference on Machine Learning, (2019), 4104-4113. |
[50] | J. Lu, Y. Lu and J. Nolen, Scaling limit of the Stein variational gradient descent: The mean field regime, SIAM J. Math. Anal., 51 (2019), 648-671. doi: 10.1137/18M1187611. |
[51] | Y.-A. Ma, T. Chen and E. Fox, A complete recipe for stochastic gradient MCMC, In Advances in Neural Information Processing Systems, (2015), 2899-2907. |
[52] | C. J. Maddison, J. Lawson, G. Tucker, N. Heess, M. Norouzi, A. Mnih, A. Doucet and Y. Teh, Filtering variational objectives, In Advances in Neural Information Processing Systems, (2017), 6573-6583. |
[53] | A. Mielke, A gradient structure for reaction–diffusion systems and for energy-drift-diffusion systems, Nonlinearity, 24 (2011), 1329-1346. doi: 10.1088/0951-7715/24/4/016. |
[54] | A. Mielke, M. A. Peletier and D. R. M. Renger, On the relation between gradient flows and the large-devation principle, with applications to Markov chains and diffusion, Potential Anal., 41 (2014), 1293-1327. doi: 10.1007/s11118-014-9418-5. |
[55] | A. Mielke, D. R. M. Renger and M. A. Peletier, A generalization of Onsager's reciprocity relations to gradient flows with nonlinear mobility, J. Non-Equilibrium Thermodynamics, 41 (2016), 141-149. |
[56] | C. Naesseth, S. Linderman, R. Ranganath and D. Blei, Variational sequential Monte Carlo, In International Conference on Artificial Intelligence and Statistics, (2018), 968-977. |
[57] | N. Nüsken and L. Richter, Solving high-dimensional Hamilton–Jacobi–Bellman PDEs using neural networks: Perspectives from the theory of controlled diffusions and measures on path space, Partial Differential Equations and Applications, 2 (2021), Paper No. 48, 48 pp. doi: 10.1007/s42985-021-00102-x. |
[58] | L. Onsager, Reciprocal relations in irreversible processes I, Phys. Rev., 37 (1931), 405-426. |
[59] | L. Onsager and S. Machlup, Fluctuations and irreversible processes, Phys. Rev., 91 (1953), 1505-1512. |
[60] | C. Orrieri, Large deviations for interacting particle systems: Joint mean-field and small-noise limit, Electron. J. Probab., 25 (2020), 1-44. doi: 10.1214/20-EJP516. |
[61] | F. Otto, Dynamics of labyrinthine pattern formation in magnetic fluids: A mean-field theory, Arch. Rational Mech. Anal., 141 (1998), 63-103. doi: 10.1007/s002050050073. |
[62] | F. Otto, The geometry of dissipative evolution equations: The porous medium equation, Comm. Partial Differential Equations, 26 (2001), 101-174. doi: 10.1081/PDE-100002243. |
[63] | F. Otto and C. Villani, Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality, J. Funct. Anal., 173 (2000), 361-400. doi: 10.1006/jfan.1999.3557. |
[64] | F. Otto and M. Westdickenberg, Eulerian calculus for the contraction in the Wasserstein distance, SIAM J. Math. Anal., 37 (2005), 1227-1255. doi: 10.1137/050622420. |
[65] | Z. Palmowski and T. Rolski, A technique for exponential change of measure for Markov processes, Bernoulli, 8 (2002), 767-785. |
[66] | G. A. Pavliotis, Stochastic Processes and Applications: Diffusion Processes, the Fokker-Planck and Langevin Equations, Springer, New York, 2014. doi: 10.1007/978-1-4939-1323-7. |
[67] | M. Reed and B. Simon, Methods of Modern Mathematical Physics: Functional Analysis, Academic Press, New York-London, 1972. |
[68] | S. Reich, Data assimilation: The Schrödinger perspective, Acta Numer., 28 (2019), 635-711. doi: 10.1017/S0962492919000011. |
[69] | S. Reich and C. J. Cotter, Ensemble filter techniques for intermittent data assimilation, Large Scale Inverse Problems. Radon Ser. Comput. Appl. Math., 13 (2013), 91-134. |
[70] | C. Robert and G. Casella, Monte Carlo Statistical Methods, Springer-Verlag, New York, 1999. doi: 10.1007/978-1-4757-3071-5. |
[71] | W. Rudin, Functional Analysis, 2nd edition, International Series in Pure and Applied Mathematics. McGraw-Hill, Inc., New York, 1991. |
[72] | F. J. R. Ruiz and M. K. Titsias, A contrastive divergence for combining variational inference and MCMC, In Proceedings of the 36th International Conference on Machine Learning, ICML, Proceedings of Machine Learning Research. PMLR, 97 (2019). |
[73] | S. Saitoh and Y. Sawano, Theory of Reproducing Kernels and Applications, Developments in Mathematics, 44. Springer, Singapore, 2016. doi: 10.1007/978-981-10-0530-5. |
[74] | T. Salimans, D. Kingma and M. Welling, Markov chain Monte Carlo and variational inference: Bridging the gap, In International Conference on Machine Learning, (2015), 1218-1226. |
[75] | B. Scholkopf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, Adaptive Computation and Machine Learning series, 2018. |
[76] | B. Schweizer, On Friedrichs inequality, Helmholtz decomposition, vector potentials, and the div-curl lemma, In Trends in Applications of Mathematics to Mechanics, (2018), 65-79. |
[77] | B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf and G. R. Lanckriet, Hilbert space embeddings and metrics on probability measures, J. Mach. Learn. Res., 11 (2010), 1517-1561. |
[78] | I. Steinwart and A. Christmann, Support Vector Machines, nformation Science and Statistics. Springer, New York, 2008. |
[79] | N. G. Trillos, D. Sanz-Alonso, et al., The Bayesian update: Variational formulations and gradient flows, Bayesian Anal., 15 (2020), 29-56. doi: 10.1214/18-BA1137. |
[80] | B. Tzen and M. Raginsky, Theoretical guarantees for sampling and inference in generative models with latent diffusions, Proceedings of Machine Learning Research, 99 (2019), 3084-3114. |
[81] | C. Villani, Optimal Transport, volume 338 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences], Springer-Verlag, Berlin, 2009. doi: 10.1007/978-3-540-71050-9. |
[82] | C. Zhang, J. Bütepage, H. Kjellström and S. Mandt, Advances in variational inference, IEEE Transactions on Pattern Analysis and Machine Intelligence, 41 (2018), 2008-2026. |
Approximations of a two-dimensional standard normal distribution using deterministic SVGD based on the ODE (4) and two different positive definite kernels