Stein variational gradient descent (SVGD) refers to a class of methods for Bayesian inference based on interacting particle systems. In this paper, we consider the originally proposed deterministic dynamics as well as a stochastic variant, each of which represent one of the two main paradigms in Bayesian computational statistics: variational inference and Markov chain Monte Carlo. As it turns out, these are tightly linked through a correspondence between gradient flow structures and large-deviation principles rooted in statistical physics. To expose this relationship, we develop the cotangent space construction for the Stein geometry, prove its basic properties, and determine the large-deviation functional governing the many-particle limit for the empirical measure. Moreover, we identify the Stein-Fisher information (or kernelised Stein discrepancy) as its leading order contribution in the long-time and many-particle regime in the sense of $ \Gamma $-convergence, shedding some light on the finite-particle properties of SVGD. Finally, we establish a comparison principle between the Stein-Fisher information and RKHS-norms that might be of independent interest.
Citation: |
[1] |
L. Ambrogioni, U. Guclu, Y. Gucluturk and M. van Gerven, Wasserstein variational gradient descent: From semi-discrete optimal transport to ensemble variational inference, arXiv: 1811.02827, 2018.
![]() |
[2] |
L. Ambrosio, N. Gigli and G. Savaré, Gradient Flows: In Metric Spaces and in the Space of Probability Measures, 2nd edition, Lectures in Mathematics ETH Zürich. Birkhäuser Verlag, Basel, 2008.
![]() ![]() |
[3] |
M. Arbel, A. Korba, A. Salim and A. Gretton, Maximum mean discrepancy gradient flow, In Advances in Neural Information Processing Systems 32, 2019.
![]() |
[4] |
A. Berlinet and C. Thomas-Agnan, Reproducing Kernel Hilbert Spaces in Probability and Statistics, Kluwer Academic Publishers, Boston, MA, 2004.
doi: 10.1007/978-1-4419-9096-9.![]() ![]() ![]() |
[5] |
L. Bertini, A. De Sole, D. Gabrielli, G. Jona-Lasinio and C. Landim, Large deviations of the empirical current in interacting particle systems, Theory of Probability & Its Applications, 51 (2007), 2-27.
doi: 10.1137/S0040585X97982256.![]() ![]() ![]() |
[6] |
P. Billingsley, Convergence of Probability Measures, John Wiley & Sons, Inc., New York-London-Sydney, 1968.
![]() ![]() |
[7] |
C. M. Bishop, Pattern Recognition and Machine Learning, Springer, New York, 2006.
![]() ![]() |
[8] |
D. M. Blei, A. Kucukelbir and J. D. McAuliffe, Variational inference: A review for statisticians, J. Amer. Statist. Assoc., 112 (2017), 859-877.
doi: 10.1080/01621459.2017.1285773.![]() ![]() ![]() |
[9] |
V. Bogachev, Measure Theory Vol. I and II, Springer, Berlin, Germany, 2007.
doi: 10.1007/978-3-540-34514-5.![]() ![]() ![]() |
[10] |
A. Braides, Gamma Convergence for Beginners, Oxford University Press, Oxford, UK, 2002.
doi: 10.1093/acprof:oso/9780198507840.001.0001.![]() ![]() ![]() |
[11] |
S. Brooks, A. Gelman, G. Jones and X.-L. Meng, Handbook of Markov Chain Monte Carlo, CRC Press, Boca Raton, FL, 2011.
doi: 10.1201/b10905.![]() ![]() ![]() |
[12] |
C. Chen and R. Zhang, Particle optimization in stochastic gradient MCMC, arXiv: 1711.10927, 2017.
![]() |
[13] |
C. Chen, R. Zhang, W. Wang, B. Li and L. Chen, A unified particle-optimization framework for scalable Bayesian sampling, In Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, UAI, AUAI Press, (2018), 746-755.
![]() |
[14] |
S. Chewi, T. L. Gouic, C. Lu, T. Maunu and P. Rigollet, SVGD as a kernelized Wasserstein gradient flow of the chi-squared divergence, In Advances in Neural Information Processing Systems 33, 2020.
![]() |
[15] |
K. Chwialkowski, H. Strathmann and A. Gretton, A kernel test of goodness of fit, In International Conference on Machine Learning, PMLR, (2016), 2606-2615.
![]() |
[16] |
S. Daneri and G. Savaré, Eulerian calculus for the displacement convexity in the Wasserstein distance, SIAM J. Math. Anal., 40 (2008), 1104-1122.
doi: 10.1137/08071346X.![]() ![]() ![]() |
[17] |
D. Dawson, Measure-valued Markov processes, In Ecole d'Eté de Probabilités de Saint-Flour XXI - 1991, Berlin-Heidelberg, Germany, Springer, 1541 (1993), 1-260.
doi: 10.1007/BFb0084190.![]() ![]() ![]() |
[18] |
D. Dawson and J. Gärtner, Large deviations from the McKean-Vlasov limit for weakly interacting diffusions, Stochastics, 20 (1987), 247-308.
doi: 10.1080/17442508708833446.![]() ![]() ![]() |
[19] |
P. Del Moral, Feynman-Kac Formulae, Springer-Verlag, New York, 2004.
doi: 10.1007/978-1-4684-9393-1.![]() ![]() ![]() |
[20] |
A. Dembo and O. Zeitouni, Large Deviations Techniques and Applications, Corrected reprint of the second (1998) edition. Stochastic Modelling and Applied Probability, 38. Springer-Verlag, Berlin, 2010.
doi: 10.1007/978-3-642-03311-7.![]() ![]() ![]() |
[21] |
H. Dietert, et al., Characterisation of gradient flows on finite state Markov chains, Electron. Commun. Probab., 20 (2015), 8 pp.
doi: 10.1214/ECP.v20-3521.![]() ![]() ![]() |
[22] |
A. Doucet, N. De Freitas and N. Gordon, An introduction to sequential Monte Carlo methods, InStat. Eng. Inf. Sci., Springer, New York, (2001), 3-14.
doi: 10.1007/978-1-4757-3437-9_1.![]() ![]() ![]() |
[23] |
J. J. Duistermaat and J. A. Kolk, Distributions, Cornerstones. Birkhäuser Boston, Inc., Boston, MA, 2010.
doi: 10.1007/978-0-8176-4675-2.![]() ![]() ![]() |
[24] |
A. Duncan, N. Nuesken and L. Szpruch, On the geometry of Stein variational gradient descent, preprint, arXiv: 1912.00894, 2019.
![]() |
[25] |
J. Feng and T. Kurtz, Large Deviations for Stochastic Processes, American Mathematical Society, Providence, RI, USA, 2006.
doi: 10.1090/surv/131.![]() ![]() ![]() |
[26] |
M. Fisher, T. Nolan, M. Graham, D. Prangle and C. J. Oates, Measure transport with kernel Stein discrepancy, In The 24th International Conference on Artificial Intelligence and Statistics, AISTATS, Proceedings of Machine Learning Research. PMLR, 2021.
![]() |
[27] |
M. I. Freidlin and A. D. Wentzell, Random Perturbations of Dynamical Systems, volume 260. Springer-Verlag, New York, 1984.
doi: 10.1007/978-1-4684-0176-9.![]() ![]() ![]() |
[28] |
K. Fukumizu, A. Gretton, G. R. Lanckriet, B. Schölkopf and B. K. Sriperumbudur, Kernel choice and classifiability for RKHS embeddings of probability distributions, In Advances in Neural Information Processing Systems, (2009), 1750-1758.
![]() |
[29] |
V. Gallego and D. R. Insua, Stochastic gradient MCMC with repulsive forces, arXiv: 1812.00071, 2018.
![]() |
[30] |
A. Garbuno-Inigo, F. Hoffmann, W. Li and A. M. Stuart, Interacting Langevin diffusions: Gradient structure and ensemble Kalman sampler, SIAM J. Appl. Dyn. Syst., 19 (2020), 412-441.
doi: 10.1137/19M1251655.![]() ![]() ![]() |
[31] |
A. Garbuno-Inigo, N. Nüsken and S. Reich, Affine invariant interacting Langevin dynamics for Bayesian inference, SIAM J. Appl. Dyn. Syst., 19 (2020), 1633-1658.
doi: 10.1137/19M1304891.![]() ![]() ![]() |
[32] |
A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari and D. B. Rubin, Bayesian Data Analysis, 3rd edition, CRC Press, Boca Raton, FL, 2014.
![]() ![]() |
[33] |
N. Gigli, Second Order Analysis on $(\mathcal{P}_2(M), W_2)$, Mem. Amer. Math. Soc. 216 (2012).
doi: 10.1090/S0065-9266-2011-00619-2.![]() ![]() ![]() |
[34] |
I. Goodfellow, Y. Bengio and A. Courville, Deep Learning, Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, 2016, http://www.deeplearningbook.org.
![]() ![]() |
[35] |
J. Gorham and L. Mackey, Measuring sample quality with kernels, In International Conference on Machine Learning, (2017), 1292-1301.
![]() |
[36] |
C. Hartmann and C. Schütte, Efficient rare event simulation by optimal nonequilibrium forcing, J. Statistical Mechanics: Theory and Experiment, 2012 (2012).
![]() |
[37] |
M. D. Hoffman, Learning deep latent gaussian models with Markov chain Monte Carlo, In International Conference on Machine Learning, (2017), 1510-1519.
![]() |
[38] |
R. Jordan, D. Kinderlehrer and F. Otto, The variational formulation of the Fokker–Planck equation, SIAM J. Math. Anal., 29 (1998), 1-17.
doi: 10.1137/S0036141096303359.![]() ![]() ![]() |
[39] |
M. Kanagawa, P. Hennig, D. Sejdinovic and B. K. Sriperumbudur, Gaussian processes and kernel methods: A review on connections and equivalences, arXiv: 1807.02582, 2018.
![]() |
[40] |
C. Kipnis and C. Landim, Scaling Limits of Interacting Particle Systems, Springer-Verlag, Berlin, 1999.
doi: 10.1007/978-3-662-03752-2.![]() ![]() ![]() |
[41] |
A. Korba, A. Salim, M. Arbel, G. Luise and A. Gretton, A non-asymptotic analysis for Stein variational gradient descent, In Advances in Neural Information Processing Systems 33, 2020.
![]() |
[42] |
E. Kreyszig, Introductory Functional Analysis with Applications, volume 1. John Wiley & Sons, New York-London-Sydney, 1978.
![]() ![]() |
[43] |
J. M. Lee, Riemannian Manifolds: An Introduction to Curvature, volume 176. Graduate Texts in Mathematics, 176. Springer-Verlag, New York, 1997.
doi: 10.1007/b98852.![]() ![]() ![]() |
[44] |
C. Liu, J. Zhuo, P. Cheng, R. Zhang and J. Zhu, Understanding and accelerating particle-based variational inference, In International Conference on Machine Learning, (2019), 4082-4092.
![]() |
[45] |
Q. Liu, Stein variational gradient descent as gradient flow, In Advances in Neural Information Processing Systems, (2017), 3115-3123.
![]() |
[46] |
Q. Liu, J. Lee and M. Jordan, A kernelized Stein discrepancy for goodness-of-fit tests, In International Conference on Machine Learning, (2016), 276-284.
![]() |
[47] |
Q. Liu and D. Wang, Stein variational gradient descent: A general purpose Bayesian inference algorithm, In Advances In Neural Information Processing Systems, (2016), 2378-2386.
![]() |
[48] |
Q. Liu and D. Wang, Stein variational gradient descent as moment matching, In Advances in Neural Information Processing Systems, (2018), 8868-8877.
![]() |
[49] |
A. Liutkus, U. Simsekli, S. Majewski, A. Durmus and F.-R. Stöter, Sliced-Wasserstein flows: Nonparametric generative modeling via optimal transport and diffusions, In International Conference on Machine Learning, (2019), 4104-4113.
![]() |
[50] |
J. Lu, Y. Lu and J. Nolen, Scaling limit of the Stein variational gradient descent: The mean field regime, SIAM J. Math. Anal., 51 (2019), 648-671.
doi: 10.1137/18M1187611.![]() ![]() ![]() |
[51] |
Y.-A. Ma, T. Chen and E. Fox, A complete recipe for stochastic gradient MCMC, In Advances in Neural Information Processing Systems, (2015), 2899-2907.
![]() |
[52] |
C. J. Maddison, J. Lawson, G. Tucker, N. Heess, M. Norouzi, A. Mnih, A. Doucet and Y. Teh, Filtering variational objectives, In Advances in Neural Information Processing Systems, (2017), 6573-6583.
![]() |
[53] |
A. Mielke, A gradient structure for reaction–diffusion systems and for energy-drift-diffusion systems, Nonlinearity, 24 (2011), 1329-1346.
doi: 10.1088/0951-7715/24/4/016.![]() ![]() ![]() |
[54] |
A. Mielke, M. A. Peletier and D. R. M. Renger, On the relation between gradient flows and the large-devation principle, with applications to Markov chains and diffusion, Potential Anal., 41 (2014), 1293-1327.
doi: 10.1007/s11118-014-9418-5.![]() ![]() ![]() |
[55] |
A. Mielke, D. R. M. Renger and M. A. Peletier, A generalization of Onsager's reciprocity relations to gradient flows with nonlinear mobility, J. Non-Equilibrium Thermodynamics, 41 (2016), 141-149.
![]() |
[56] |
C. Naesseth, S. Linderman, R. Ranganath and D. Blei, Variational sequential Monte Carlo, In International Conference on Artificial Intelligence and Statistics, (2018), 968-977.
![]() |
[57] |
N. Nüsken and L. Richter, Solving high-dimensional Hamilton–Jacobi–Bellman PDEs using neural networks: Perspectives from the theory of controlled diffusions and measures on path space, Partial Differential Equations and Applications, 2 (2021), Paper No. 48, 48 pp.
doi: 10.1007/s42985-021-00102-x.![]() ![]() ![]() |
[58] |
L. Onsager, Reciprocal relations in irreversible processes I, Phys. Rev., 37 (1931), 405-426.
![]() |
[59] |
L. Onsager and S. Machlup, Fluctuations and irreversible processes, Phys. Rev., 91 (1953), 1505-1512.
![]() ![]() |
[60] |
C. Orrieri, Large deviations for interacting particle systems: Joint mean-field and small-noise limit, Electron. J. Probab., 25 (2020), 1-44.
doi: 10.1214/20-EJP516.![]() ![]() ![]() |
[61] |
F. Otto, Dynamics of labyrinthine pattern formation in magnetic fluids: A mean-field theory, Arch. Rational Mech. Anal., 141 (1998), 63-103.
doi: 10.1007/s002050050073.![]() ![]() ![]() |
[62] |
F. Otto, The geometry of dissipative evolution equations: The porous medium equation, Comm. Partial Differential Equations, 26 (2001), 101-174.
doi: 10.1081/PDE-100002243.![]() ![]() ![]() |
[63] |
F. Otto and C. Villani, Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality, J. Funct. Anal., 173 (2000), 361-400.
doi: 10.1006/jfan.1999.3557.![]() ![]() ![]() |
[64] |
F. Otto and M. Westdickenberg, Eulerian calculus for the contraction in the Wasserstein distance, SIAM J. Math. Anal., 37 (2005), 1227-1255.
doi: 10.1137/050622420.![]() ![]() ![]() |
[65] |
Z. Palmowski and T. Rolski, A technique for exponential change of measure for Markov processes, Bernoulli, 8 (2002), 767-785.
![]() ![]() |
[66] |
G. A. Pavliotis, Stochastic Processes and Applications: Diffusion Processes, the Fokker-Planck and Langevin Equations, Springer, New York, 2014.
doi: 10.1007/978-1-4939-1323-7.![]() ![]() ![]() |
[67] |
M. Reed and B. Simon, Methods of Modern Mathematical Physics: Functional Analysis, Academic Press, New York-London, 1972.
![]() ![]() |
[68] |
S. Reich, Data assimilation: The Schrödinger perspective, Acta Numer., 28 (2019), 635-711.
doi: 10.1017/S0962492919000011.![]() ![]() ![]() |
[69] |
S. Reich and C. J. Cotter, Ensemble filter techniques for intermittent data assimilation, Large Scale Inverse Problems. Radon Ser. Comput. Appl. Math., 13 (2013), 91-134.
![]() ![]() |
[70] |
C. Robert and G. Casella, Monte Carlo Statistical Methods, Springer-Verlag, New York, 1999.
doi: 10.1007/978-1-4757-3071-5.![]() ![]() ![]() |
[71] |
W. Rudin, Functional Analysis, 2nd edition, International Series in Pure and Applied Mathematics. McGraw-Hill, Inc., New York, 1991.
![]() ![]() |
[72] |
F. J. R. Ruiz and M. K. Titsias, A contrastive divergence for combining variational inference and MCMC, In Proceedings of the 36th International Conference on Machine Learning, ICML, Proceedings of Machine Learning Research. PMLR, 97 (2019).
![]() |
[73] |
S. Saitoh and Y. Sawano, Theory of Reproducing Kernels and Applications, Developments in Mathematics, 44. Springer, Singapore, 2016.
doi: 10.1007/978-981-10-0530-5.![]() ![]() ![]() |
[74] |
T. Salimans, D. Kingma and M. Welling, Markov chain Monte Carlo and variational inference: Bridging the gap, In International Conference on Machine Learning, (2015), 1218-1226.
![]() |
[75] |
B. Scholkopf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, Adaptive Computation and Machine Learning series, 2018.
![]() |
[76] |
B. Schweizer, On Friedrichs inequality, Helmholtz decomposition, vector potentials, and the div-curl lemma, In Trends in Applications of Mathematics to Mechanics, (2018), 65-79.
![]() ![]() |
[77] |
B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf and G. R. Lanckriet, Hilbert space embeddings and metrics on probability measures, J. Mach. Learn. Res., 11 (2010), 1517-1561.
![]() ![]() |
[78] |
I. Steinwart and A. Christmann, Support Vector Machines, nformation Science and Statistics. Springer, New York, 2008.
![]() ![]() |
[79] |
N. G. Trillos, D. Sanz-Alonso, et al., The Bayesian update: Variational formulations and gradient flows, Bayesian Anal., 15 (2020), 29-56.
doi: 10.1214/18-BA1137.![]() ![]() ![]() |
[80] |
B. Tzen and M. Raginsky, Theoretical guarantees for sampling and inference in generative models with latent diffusions, Proceedings of Machine Learning Research, 99 (2019), 3084-3114.
![]() |
[81] |
C. Villani, Optimal Transport, volume 338 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences], Springer-Verlag, Berlin, 2009.
doi: 10.1007/978-3-540-71050-9.![]() ![]() ![]() |
[82] |
C. Zhang, J. Bütepage, H. Kjellström and S. Mandt, Advances in variational inference, IEEE Transactions on Pattern Analysis and Machine Intelligence, 41 (2018), 2008-2026.
![]() |
Approximations of a two-dimensional standard normal distribution using deterministic SVGD based on the ODE (4) and two different positive definite kernels