Advanced Search
Article Contents
Article Contents

Stein variational gradient descent: Many-particle and long-time asymptotics

  • *Corresponding author: Nikolas Nüsken

    *Corresponding author: Nikolas Nüsken 
Abstract Full Text(HTML) Figure(1) Related Papers Cited by
  • Stein variational gradient descent (SVGD) refers to a class of methods for Bayesian inference based on interacting particle systems. In this paper, we consider the originally proposed deterministic dynamics as well as a stochastic variant, each of which represent one of the two main paradigms in Bayesian computational statistics: variational inference and Markov chain Monte Carlo. As it turns out, these are tightly linked through a correspondence between gradient flow structures and large-deviation principles rooted in statistical physics. To expose this relationship, we develop the cotangent space construction for the Stein geometry, prove its basic properties, and determine the large-deviation functional governing the many-particle limit for the empirical measure. Moreover, we identify the Stein-Fisher information (or kernelised Stein discrepancy) as its leading order contribution in the long-time and many-particle regime in the sense of $ \Gamma $-convergence, shedding some light on the finite-particle properties of SVGD. Finally, we establish a comparison principle between the Stein-Fisher information and RKHS-norms that might be of independent interest.

    Mathematics Subject Classification: Primary: 58F15, 58F17; Secondary: 53C35.


    \begin{equation} \\ \end{equation}
  • 加载中
  • Figure 1.  Approximations of a two-dimensional standard normal distribution using deterministic SVGD based on the ODE (4) and two different positive definite kernels $ k_{p, \sigma} $

  • [1] L. Ambrogioni, U. Guclu, Y. Gucluturk and M. van Gerven, Wasserstein variational gradient descent: From semi-discrete optimal transport to ensemble variational inference, arXiv: 1811.02827, 2018.
    [2] L. Ambrosio, N. Gigli and G. Savaré, Gradient Flows: In Metric Spaces and in the Space of Probability Measures, 2nd edition, Lectures in Mathematics ETH Zürich. Birkhäuser Verlag, Basel, 2008.
    [3] M. Arbel, A. Korba, A. Salim and A. Gretton, Maximum mean discrepancy gradient flow, In Advances in Neural Information Processing Systems 32, 2019.
    [4] A. Berlinet and C. Thomas-Agnan, Reproducing Kernel Hilbert Spaces in Probability and Statistics, Kluwer Academic Publishers, Boston, MA, 2004. doi: 10.1007/978-1-4419-9096-9.
    [5] L. BertiniA. De SoleD. GabrielliG. Jona-Lasinio and C. Landim, Large deviations of the empirical current in interacting particle systems, Theory of Probability & Its Applications, 51 (2007), 2-27.  doi: 10.1137/S0040585X97982256.
    [6] P. Billingsley, Convergence of Probability Measures, John Wiley & Sons, Inc., New York-London-Sydney, 1968.
    [7] C. M. Bishop, Pattern Recognition and Machine Learning, Springer, New York, 2006.
    [8] D. M. BleiA. Kucukelbir and J. D. McAuliffe, Variational inference: A review for statisticians, J. Amer. Statist. Assoc., 112 (2017), 859-877.  doi: 10.1080/01621459.2017.1285773.
    [9] V. Bogachev, Measure Theory Vol. I and II, Springer, Berlin, Germany, 2007. doi: 10.1007/978-3-540-34514-5.
    [10] A. BraidesGamma Convergence for Beginners, Oxford University Press, Oxford, UK, 2002.  doi: 10.1093/acprof:oso/9780198507840.001.0001.
    [11] S. BrooksA. GelmanG. Jones and  X.-L. MengHandbook of Markov Chain Monte Carlo, CRC Press, Boca Raton, FL, 2011.  doi: 10.1201/b10905.
    [12] C. Chen and R. Zhang, Particle optimization in stochastic gradient MCMC, arXiv: 1711.10927, 2017.
    [13] C. Chen, R. Zhang, W. Wang, B. Li and L. Chen, A unified particle-optimization framework for scalable Bayesian sampling, In Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, UAI, AUAI Press, (2018), 746-755.
    [14] S. Chewi, T. L. Gouic, C. Lu, T. Maunu and P. Rigollet, SVGD as a kernelized Wasserstein gradient flow of the chi-squared divergence, In Advances in Neural Information Processing Systems 33, 2020.
    [15] K. Chwialkowski, H. Strathmann and A. Gretton, A kernel test of goodness of fit, In International Conference on Machine Learning, PMLR, (2016), 2606-2615.
    [16] S. Daneri and G. Savaré, Eulerian calculus for the displacement convexity in the Wasserstein distance, SIAM J. Math. Anal., 40 (2008), 1104-1122.  doi: 10.1137/08071346X.
    [17] D. Dawson, Measure-valued Markov processes, In Ecole d'Eté de Probabilités de Saint-Flour XXI - 1991, Berlin-Heidelberg, Germany, Springer, 1541 (1993), 1-260. doi: 10.1007/BFb0084190.
    [18] D. Dawson and J. Gärtner, Large deviations from the McKean-Vlasov limit for weakly interacting diffusions, Stochastics, 20 (1987), 247-308.  doi: 10.1080/17442508708833446.
    [19] P. Del Moral, Feynman-Kac Formulae, Springer-Verlag, New York, 2004. doi: 10.1007/978-1-4684-9393-1.
    [20] A. Dembo and O. Zeitouni, Large Deviations Techniques and Applications, Corrected reprint of the second (1998) edition. Stochastic Modelling and Applied Probability, 38. Springer-Verlag, Berlin, 2010. doi: 10.1007/978-3-642-03311-7.
    [21] H. Dietert, et al., Characterisation of gradient flows on finite state Markov chains, Electron. Commun. Probab., 20 (2015), 8 pp. doi: 10.1214/ECP.v20-3521.
    [22] A. Doucet, N. De Freitas and N. Gordon, An introduction to sequential Monte Carlo methods, InStat. Eng. Inf. Sci., Springer, New York, (2001), 3-14. doi: 10.1007/978-1-4757-3437-9_1.
    [23] J. J. Duistermaat and J. A. Kolk, Distributions, Cornerstones. Birkhäuser Boston, Inc., Boston, MA, 2010. doi: 10.1007/978-0-8176-4675-2.
    [24] A. Duncan, N. Nuesken and L. Szpruch, On the geometry of Stein variational gradient descent, preprint, arXiv: 1912.00894, 2019.
    [25] J. Feng and T. Kurtz, Large Deviations for Stochastic Processes, American Mathematical Society, Providence, RI, USA, 2006. doi: 10.1090/surv/131.
    [26] M. Fisher, T. Nolan, M. Graham, D. Prangle and C. J. Oates, Measure transport with kernel Stein discrepancy, In The 24th International Conference on Artificial Intelligence and Statistics, AISTATS, Proceedings of Machine Learning Research. PMLR, 2021.
    [27] M. I. Freidlin and A. D. Wentzell, Random Perturbations of Dynamical Systems, volume 260. Springer-Verlag, New York, 1984. doi: 10.1007/978-1-4684-0176-9.
    [28] K. Fukumizu, A. Gretton, G. R. Lanckriet, B. Schölkopf and B. K. Sriperumbudur, Kernel choice and classifiability for RKHS embeddings of probability distributions, In Advances in Neural Information Processing Systems, (2009), 1750-1758.
    [29] V. Gallego and D. R. Insua, Stochastic gradient MCMC with repulsive forces, arXiv: 1812.00071, 2018.
    [30] A. Garbuno-InigoF. HoffmannW. Li and A. M. Stuart, Interacting Langevin diffusions: Gradient structure and ensemble Kalman sampler, SIAM J. Appl. Dyn. Syst., 19 (2020), 412-441.  doi: 10.1137/19M1251655.
    [31] A. Garbuno-InigoN. Nüsken and S. Reich, Affine invariant interacting Langevin dynamics for Bayesian inference, SIAM J. Appl. Dyn. Syst., 19 (2020), 1633-1658.  doi: 10.1137/19M1304891.
    [32] A. GelmanJ. B. CarlinH. S. SternD. B. DunsonA. Vehtari and  D. B. RubinBayesian Data Analysis, 3rd edition, CRC Press, Boca Raton, FL, 2014. 
    [33] N. Gigli, Second Order Analysis on $(\mathcal{P}_2(M), W_2)$, Mem. Amer. Math. Soc. 216 (2012). doi: 10.1090/S0065-9266-2011-00619-2.
    [34] I. Goodfellow, Y. Bengio and A. Courville, Deep Learning, Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, 2016, http://www.deeplearningbook.org.
    [35] J. Gorham and L. Mackey, Measuring sample quality with kernels, In International Conference on Machine Learning, (2017), 1292-1301.
    [36] C. Hartmann and C. Schütte, Efficient rare event simulation by optimal nonequilibrium forcing, J. Statistical Mechanics: Theory and Experiment, 2012 (2012). 
    [37] M. D. Hoffman, Learning deep latent gaussian models with Markov chain Monte Carlo, In International Conference on Machine Learning, (2017), 1510-1519.
    [38] R. JordanD. Kinderlehrer and F. Otto, The variational formulation of the Fokker–Planck equation, SIAM J. Math. Anal., 29 (1998), 1-17.  doi: 10.1137/S0036141096303359.
    [39] M. Kanagawa, P. Hennig, D. Sejdinovic and B. K. Sriperumbudur, Gaussian processes and kernel methods: A review on connections and equivalences, arXiv: 1807.02582, 2018.
    [40] C. Kipnis and C. Landim, Scaling Limits of Interacting Particle Systems, Springer-Verlag, Berlin, 1999. doi: 10.1007/978-3-662-03752-2.
    [41] A. Korba, A. Salim, M. Arbel, G. Luise and A. Gretton, A non-asymptotic analysis for Stein variational gradient descent, In Advances in Neural Information Processing Systems 33, 2020.
    [42] E. Kreyszig, Introductory Functional Analysis with Applications, volume 1. John Wiley & Sons, New York-London-Sydney, 1978.
    [43] J. M. Lee, Riemannian Manifolds: An Introduction to Curvature, volume 176. Graduate Texts in Mathematics, 176. Springer-Verlag, New York, 1997. doi: 10.1007/b98852.
    [44] C. Liu, J. Zhuo, P. Cheng, R. Zhang and J. Zhu, Understanding and accelerating particle-based variational inference, In International Conference on Machine Learning, (2019), 4082-4092.
    [45] Q. Liu, Stein variational gradient descent as gradient flow, In Advances in Neural Information Processing Systems, (2017), 3115-3123.
    [46] Q. Liu, J. Lee and M. Jordan, A kernelized Stein discrepancy for goodness-of-fit tests, In International Conference on Machine Learning, (2016), 276-284.
    [47] Q. Liu and D. Wang, Stein variational gradient descent: A general purpose Bayesian inference algorithm, In Advances In Neural Information Processing Systems, (2016), 2378-2386.
    [48] Q. Liu and D. Wang, Stein variational gradient descent as moment matching, In Advances in Neural Information Processing Systems, (2018), 8868-8877.
    [49] A. Liutkus, U. Simsekli, S. Majewski, A. Durmus and F.-R. Stöter, Sliced-Wasserstein flows: Nonparametric generative modeling via optimal transport and diffusions, In International Conference on Machine Learning, (2019), 4104-4113.
    [50] J. LuY. Lu and J. Nolen, Scaling limit of the Stein variational gradient descent: The mean field regime, SIAM J. Math. Anal., 51 (2019), 648-671.  doi: 10.1137/18M1187611.
    [51] Y.-A. Ma, T. Chen and E. Fox, A complete recipe for stochastic gradient MCMC, In Advances in Neural Information Processing Systems, (2015), 2899-2907.
    [52] C. J. Maddison, J. Lawson, G. Tucker, N. Heess, M. Norouzi, A. Mnih, A. Doucet and Y. Teh, Filtering variational objectives, In Advances in Neural Information Processing Systems, (2017), 6573-6583.
    [53] A. Mielke, A gradient structure for reaction–diffusion systems and for energy-drift-diffusion systems, Nonlinearity, 24 (2011), 1329-1346.  doi: 10.1088/0951-7715/24/4/016.
    [54] A. MielkeM. A. Peletier and D. R. M. Renger, On the relation between gradient flows and the large-devation principle, with applications to Markov chains and diffusion, Potential Anal., 41 (2014), 1293-1327.  doi: 10.1007/s11118-014-9418-5.
    [55] A. MielkeD. R. M. Renger and M. A. Peletier, A generalization of Onsager's reciprocity relations to gradient flows with nonlinear mobility, J. Non-Equilibrium Thermodynamics, 41 (2016), 141-149. 
    [56] C. Naesseth, S. Linderman, R. Ranganath and D. Blei, Variational sequential Monte Carlo, In International Conference on Artificial Intelligence and Statistics, (2018), 968-977.
    [57] N. Nüsken and L. Richter, Solving high-dimensional Hamilton–Jacobi–Bellman PDEs using neural networks: Perspectives from the theory of controlled diffusions and measures on path space, Partial Differential Equations and Applications, 2 (2021), Paper No. 48, 48 pp. doi: 10.1007/s42985-021-00102-x.
    [58] L. Onsager, Reciprocal relations in irreversible processes I, Phys. Rev., 37 (1931), 405-426. 
    [59] L. Onsager and S. Machlup, Fluctuations and irreversible processes, Phys. Rev., 91 (1953), 1505-1512. 
    [60] C. Orrieri, Large deviations for interacting particle systems: Joint mean-field and small-noise limit, Electron. J. Probab., 25 (2020), 1-44.  doi: 10.1214/20-EJP516.
    [61] F. Otto, Dynamics of labyrinthine pattern formation in magnetic fluids: A mean-field theory, Arch. Rational Mech. Anal., 141 (1998), 63-103.  doi: 10.1007/s002050050073.
    [62] F. Otto, The geometry of dissipative evolution equations: The porous medium equation, Comm. Partial Differential Equations, 26 (2001), 101-174.  doi: 10.1081/PDE-100002243.
    [63] F. Otto and C. Villani, Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality, J. Funct. Anal., 173 (2000), 361-400.  doi: 10.1006/jfan.1999.3557.
    [64] F. Otto and M. Westdickenberg, Eulerian calculus for the contraction in the Wasserstein distance, SIAM J. Math. Anal., 37 (2005), 1227-1255.  doi: 10.1137/050622420.
    [65] Z. Palmowski and T. Rolski, A technique for exponential change of measure for Markov processes, Bernoulli, 8 (2002), 767-785. 
    [66] G. A. Pavliotis, Stochastic Processes and Applications: Diffusion Processes, the Fokker-Planck and Langevin Equations, Springer, New York, 2014. doi: 10.1007/978-1-4939-1323-7.
    [67] M. Reed and  B. SimonMethods of Modern Mathematical Physics: Functional Analysis, Academic Press, New York-London, 1972. 
    [68] S. Reich, Data assimilation: The Schrödinger perspective, Acta Numer., 28 (2019), 635-711.  doi: 10.1017/S0962492919000011.
    [69] S. Reich and C. J. Cotter, Ensemble filter techniques for intermittent data assimilation, Large Scale Inverse Problems. Radon Ser. Comput. Appl. Math., 13 (2013), 91-134. 
    [70] C. Robert and G. Casella, Monte Carlo Statistical Methods, Springer-Verlag, New York, 1999. doi: 10.1007/978-1-4757-3071-5.
    [71] W. Rudin, Functional Analysis, 2nd edition, International Series in Pure and Applied Mathematics. McGraw-Hill, Inc., New York, 1991.
    [72] F. J. R. Ruiz and M. K. Titsias, A contrastive divergence for combining variational inference and MCMC, In Proceedings of the 36th International Conference on Machine Learning, ICML, Proceedings of Machine Learning Research. PMLR, 97 (2019).
    [73] S. Saitoh and Y. Sawano, Theory of Reproducing Kernels and Applications, Developments in Mathematics, 44. Springer, Singapore, 2016. doi: 10.1007/978-981-10-0530-5.
    [74] T. Salimans, D. Kingma and M. Welling, Markov chain Monte Carlo and variational inference: Bridging the gap, In International Conference on Machine Learning, (2015), 1218-1226.
    [75] B. Scholkopf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, Adaptive Computation and Machine Learning series, 2018.
    [76] B. Schweizer, On Friedrichs inequality, Helmholtz decomposition, vector potentials, and the div-curl lemma, In Trends in Applications of Mathematics to Mechanics, (2018), 65-79.
    [77] B. K. SriperumbudurA. GrettonK. FukumizuB. Schölkopf and G. R. Lanckriet, Hilbert space embeddings and metrics on probability measures, J. Mach. Learn. Res., 11 (2010), 1517-1561. 
    [78] I. Steinwart and A. Christmann, Support Vector Machines, nformation Science and Statistics. Springer, New York, 2008.
    [79] N. G. Trillos, D. Sanz-Alonso, et al., The Bayesian update: Variational formulations and gradient flows, Bayesian Anal., 15 (2020), 29-56. doi: 10.1214/18-BA1137.
    [80] B. Tzen and M. Raginsky, Theoretical guarantees for sampling and inference in generative models with latent diffusions, Proceedings of Machine Learning Research, 99 (2019), 3084-3114. 
    [81] C. Villani, Optimal Transport, volume 338 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences], Springer-Verlag, Berlin, 2009. doi: 10.1007/978-3-540-71050-9.
    [82] C. ZhangJ. BütepageH. Kjellström and S. Mandt, Advances in variational inference, IEEE Transactions on Pattern Analysis and Machine Intelligence, 41 (2018), 2008-2026. 
  • 加载中



Article Metrics

HTML views(1140) PDF downloads(229) Cited by(0)

Access History

Other Articles By Authors



    DownLoad:  Full-Size Img  PowerPoint