Discrete gradients for computational Bayesian inference

In this paper, we exploit the gradient flow structure of continuous-time formulations of Bayesian inference in terms of their numerical time-stepping. We focus on two particular examples, namely, the continuous-time ensemble Kalman-Bucy filter and a particle discretisation of the Fokker-Planck equation associated to Brownian dynamics. Both formulations can lead to stiff differential equations which require special numerical methods for their efficient numerical implementation. We compare discrete gradient methods to alternative semi-implicit and other iterative implementations of the underlying Bayesian inference problems.


Introduction
A number of algorithmic approaches to Bayesian inference (Robert 2001, Tarantola 2005, Kaipio & Somersalo 2005) can be rephrased in terms of evolution equations of the form (1) dz τ = −A(z τ )∇ z V (z τ ) dτ, τ ≥ 0, with A(z) being symmetric positive semi-definite and V : R Nz → R being an appropriate potential.It holds that (2) dV (z τ ) = −∇ z V (z τ ) • A(z τ )∇ z V (z τ ) dτ ≤ 0 along solutions z τ ∈ R Nz of (1).ODEs of the form (1) are referred to as gradient dynamics or of gradient flow structure (Stuart & Hymphries 1996).Generally speaking, numerical approximations of (1) will not preserve the decay property (2) unless the time-step ∆τ > 0 is chosen sufficiently small.Such time-step restrictions can be problematic if the ODE system (1) is stiff (Ascher 2008).Discrete gradient methods (McLachlan et al. 1999) are attractive in this context, as they have been devised such that (2) holds regardless of the chosen step-size ∆τ .However, these methods are implicit, which implies an increased computational cost per time-step.In this paper, we explore the pros and cons of discrete gradient methods in the context of computational Bayesian inference.More specifically, we consider two particular examples of such gradient dynamics arising from particle methods for sequential data assimilation (Law et al. 2015, Reich & Cotter 2015) and from a transformation approach to posterior sampling (Crisan & Xiong 2010, Daum & Huang 2011, Reich 2011).In both cases, the decay property (2) plays a crucial role in ensuring the statistical consistency of the associated inference method when the state variable z τ is treated as a random variable with the initial z 0 following a given prior distribution π 0 .
The paper is structured as follows.Section 2 provides the necessary background for the two considered continuous-time gradient flow systems, namely, the ensemble Kalman-Bucy filter (EnKBF) and a particle discretisation of the Fokker-Planck equations associated to Brownian dynamics.Section 3 summarises the discrete gradient method and its numerical implementation.We also discuss the semi-implicit Euler method as an alternative time-stepping method.Numerical results and a comparison between different methods is provided in Section 4. The paper closes with a summary in Section 5.

Examples of gradient systems for Bayesian inference
The following two examples of gradient flow systems are both concerned with dynamically transforming a set of Monte Carlo samples {x i 0 }, i = 1, . . ., M , from a prior distribution π 0 (x) into samples {x i τ } from a posterior distribution (3) π * (x) ∝ π(y|x) π 0 (x), which can be used to approximate expectation values of a function g of a random variable X * with distribution given by (3), that is: The first example, namely, the ensemble Kalman-Bucy filter (Bergemann & Reich 2010a,b, 2012), achieves this for τ = 1, while the second example, based on a particle discretisation of the Fokker-Planck equation (Russo 1990, Degond & Mustieles 1990, Reich 2019), requires τ → ∞.We now describe both approaches in more detail.
2.1.Ensemble Kalman-Bucy filter.We have M particles x i τ ∈ R Nx that evolve in time according to the gradient dynamics (Bergemann & Reich 2010a,b) (5) with the symmetric positive semi-definite (covariance) matrix P xx τ given by ( 6) x j τ and the potential V : R Nx×M → R by 1 Here S(x) denotes the negative log-likelihood function, that is, We are primarily interested in data models of the form and, hence, (10) The ODE system (5) is typically solved over a unit time interval, that is, τ ∈ [0, 1].The initial particles {x i 0 } represents the prior distribution while the particles {x i τ } at final time approximate the posterior distribution π * and can be used in (4) with τ = 1.
The ODE system ( 5) is of the form (1) with ( 11) and A(z τ ) being a block diagonal matrix with repeated entries P xx τ .Numerical problems arising from the stiffness of the gradient dynamics have been discussed by Amezcua et al. (2014).
We mention that there are derivative-free extensions of the EnKBF to non-linear forward maps h (Bergemann & Reich 2012, Reich & Cotter 2015).Consider, for example, ( 12) This formulation is no longer of gradient flow structure unless h is linear.There are also stochastic formulations of the EnKBF (Bergemann & Reich 2012, Law et al. 2015, Reich & Cotter 2015).We finally mention that EnKBF is closely related to iterative implementations of the ensemble Kalman filter as considered, for example, by Emerik & Reynolds (2012), Sakov et al. (2012), Chen & Oliver (2013).Those implementations can also be thought of particular time-stepping methods for the continuous-time EnKBF.We will return to this aspect in Section 4.1.Iterative implementations of the Kalman-Bucy filter are also related to the natural gradient approach from statistical learning (Ollivier 2018).
2.2.Particle flow Fokker-Planck dynamics.Given M particles x i τ and a kernel function ψ(x) ≥ 0 of a Reproducing Kernel Hilbert Space (RKHS) H, we approximate their density in H by ( 14) where π τ stands for the empirical measure and we have assumed that ψ(x) = ψ(−x) as well as ( 16) ψ(x) dx = 1.
Let us denote the standard L 2 -inner product of two functions f and g by f, g and the associated inner product in H by f, g H .The Kullback-Leibler divergence between π τ (x) and the posterior target PDF π * (x) := π(x|y) is then approximated in H by and the particle flow Fokker-Planck dynamics is given by ( 20) Reich (2019) for more details.The gradient of V with respect to x i τ is given by ( 21) We note that one can replace the normalised kernel ψ by an unnormalised version ψ without changing the gradient dynamics.Furthermore, the ODE system (20) is of the form (1) with z τ defined by (11) and A(z τ ) = M I where I is the identity matrix.
Let us look at the evolution equations (20) and their geometric properties in some more detail.The variational derivative of the Kullback-Leibler divergence in H, with respect to π τ is given by and, hence, one finds that Furthermore, as τ → ∞, the particle flow Fokker-Planck dynamics (20) leads to and the asymptotic particle positions {x i ∞ } can be used to approximate expectation values with respect to X * ∼ π * based on which follows from ( 23) for an equilibrium density π ∞ , that is, We use the simpler, less accurate in the limit M → ∞ empirical approximation (4) for τ > 0 sufficiently large in the numerical experiments of Section 4.
For M → ∞ and the kernel function ψ(x) approaching a Dirac delta function, the evolution equations ( 20) provide a mesh-free particle approximation to the Fokker-Planck equation for the marginal PDFs p t of Brownian dynamics ( 28) (Reich & Cotter 2015).It holds under suitable conditions on π * that the marginal distributions p τ of the solutions X τ to the SDE (28) satisfy ( 29) lim in a weak sense (Pavliotis 2014).
As a specific example, let us consider the RKHS H with Gaussian kernel for given covariance matrix B.Here n(x, x, B) denotes the probability density function of a Gaussian random variable with mean x and covariance matrix B. We now provide a construction for the required covariance matrix B and the initial particle positions x i 0 that yields a kernel approximation to the given prior distribution π 0 .Given a set of M samples x i 0 from the prior distribution π 0 , the associated empirical measure is given by with empirical mean x 0 and empirical covariance matrix 30), the associated measure ( 14) in H is defined as follows: (33) with particle positions (34) . In other words, the covariance matrix B and the particle positions x i 0 are chosen such that the mean and covariance matrix under π 0 are identical to x 0 and P xx 0 , respectively, that is, x i 0 and ( 36) The initial particle positions {x i 0 } are now evolved under the ODEs (20).Note that α = 1 leads to x i 0 = x 0 and points of ∇ x i V = 0, that is, stationary points of (20) agree with critical points of the posterior distribution π * .It is possible to extend the dynamics in the particles {x i τ } by an evolution equation in the kernel matrix B (Yserentant 1997).Furthermore, one can put (20) into the more general form of (1), that is, for i = 1, . . ., M .A particular choice is given by A({x j τ }) = M P xx t .A related preconditioned Brownian dynamics formulation (28) has been proposed recently by Garbuno-Inigo et al. (2019).
We finally mention that the Stein variational approach (Liu & Wang 2016) has recently become popular as an alternative dynamical approach for transforming a set of Monte Carlo samples from a prior distribution into samples from a posterior distribution.However, it is currently unknown whether the resulting interacting particle systems possess a gradient flow structure.See Detommaso et al. (2018) for a preconditioned Stein variational formulation in the spirit of (37).

Discrete gradient time-stepping
Discrete gradient methods are numerical time-stepping methods for ODEs of type (1) that allow one to maintain the decay property (2) under arbitrary choices of the step-size ∆τ > 0. See McLachlan et al. (1999) for an overview of such methods.Here we focus on the following particular form of discrete gradient dynamics: and symmetric positive definite matrix A(z).It follows that as desired.Note that ( 40) is a rather natural discrete representation of the continuous (2).
There exists several variants of the method, for example, the time-symmetric formulation of Gonzalez (1996) (that is, with θ = 1/2) and the local update version of Reich (1996).
We rewrite (38) in the form (41) with the scalar factor γ n defined by ( 42) Hence one can think of the discrete gradient method (38) as a θ-method (Stuart & Hymphries 1996) with an implicitly determined effective step-size Formulation (41) with θ > 0 leads naturally to the following iterative solution procedure at each time-step τ n : • Set l = 0, z l n+θ = z n , and γ l n = 1.• For l ≥ 0 solve till convergence the minimisation problem (44) and set The minimisation problem (44) can be solved by the Gauss-Newton method (Nocedal & Wright 2006) when applied to the EnKBF with data model ( 9).Hence only first order derivatives of the nonlinear forward operator h are required.Alternatively, one can use a quasi-Newton method (Nocedal & Wright 2006) directly on the nonlinear equations arising from the chosen time-step method.In case A(z) is only positive semi-definite, the inverse in (44) should be replaced by the pseudo-inverse and the update is performed in the subspace spanned by the image of A(z l n+θ ).In the subsequent numerical examples, we will compare the discrete gradient formulation (38) to the standard semi-implicit Euler method (47) which can also be reformulated as a minimisation problem, that is, (48) We note that ( 48) is structurally similar to the minimisation problem (44).However, the semi-implicit Euler method (47) requires the solution of such a minimisation problem only once per time-step while the discrete gradient method leads to a more elaborate implementation.Furthermore, iteration (47) does not obey a discrete decay property of the form (40).
While the discrete decay property ( 40) is sufficient to reach a critical point of (20) at τ → ∞, a sufficiently accurate approximation of the ensemble Kalman-Bucy filter equations is required in order to reproduce the posterior distribution.This statement is confirmed by the numerical experiments in the following section.We note in this context that the discrete gradient dynamics ( 38) is first order for all θ ∈ (0, 1] except for θ = 1/2 when the method becomes second order.

Numerical results
We study the behaviour of the proposed gradient flow systems and their discretisation for a linear and a nonlinear one-dimensional inference problem, as well as for a partially observed nonlinear three-dimensional system.We start with the ensemble Kalman-Bucy filter formulation.The experimental parameters are set to m 0 = 1/2, σ 0 = 1, r = 0.02, y = 0.1.We implement the EnKBF formulation with M = 2 ensemble members.Their initial positions x i 0 are uniquely determined by the requirement that (52) Furthermore, the analytic solutions x i τ , τ ∈ (0, 1], can be recovered from the associated solutions for the mean and variance, that is, with time-dependent Kalman gain (54) The potential (7) reduces to with gradients (56) The matrix A(x 1 , x 2 ) is given by ( 57) We implement the discrete gradient method (38) with θ = 1 and the semi-implicit Euler (47) method for ∆τ ∈ {0.1, 0.2, 0.5, 1}.The relative small value of r renders (5) stiff and the explicit Euler method is unstable for any of the four chosen step-sizes ∆τ .The numerical results are compared to the analytic solutions as given by ( 53) and can be found in Figure 1.We find that the discrete gradient method systematically overestimates the variance while the semi-implicit Euler method has the opposite effect on the variance.The semi-implicit Euler method also performs remarkably well with respect to the mean while the discrete gradient method leads to relatively large errors in the final mean for step-sizes ∆τ = 0.5 and ∆τ = 1.0.
4.1.2.Nonlinear problem.We consider the following nonlinear forward map and a measurement error Ξ ∼ N(0, r) with r = 1.The observed value is y = 2 and the prior distribution is Gaussian with X ∼ N(−2, 1/2).The posterior target distribution is given by The posterior mean and variance are given by x * ≈ 0.2095 and σ * ≈ 0.0211, respectively.See Example 5.9 in Reich & Cotter (2015) for more details.Given M particles x i n and a state vector z n = (x 1 n , . . ., x M n ) T the required solution z l+1 n+θ ∈ R M to the minimisation problem (44) can be recast in the following form: where f : R M → R 2M +1 is defined by ( 61) x i , and the norm The Gauss-Newton method replaces the nonlinear minimisation problem (60) by the linearised problem (63) Here the Jacobian matrix Df (z) ∈ R 2M +1×M is provided by ( 64) and D(c) ∈ R M ×M is a diagonal matrix with the entries of c on its diagonal.The semi-implicit Euler method (47) leads to a rather similar Gauss-Newton iteration (66) We compare the results from the semi-implicit Euler and discrete gradient method with θ = 1 to the gradient-free, explicit time-stepping method (IEnKF) (68) which has been discussed, for example, by Amezcua et al. (2014), Blömker et al. (2018), de Wiljes et al. (2018).The discretisation ( 68) is also related to the DEnKF formulation of the ensemble Kalman as proposed by Sakov & Oke (2008).
We used M = 100 particles in our numerical experiments and step-sizes of ∆τ ∈ {0.01, 0.1, 0.2, 0.5}.An explicit Euler discretisation is stable only for the smallest of these step-sizes and is used to provide the reference solution.See Figure 2 for details.
We find that both the semi-implicit Euler and the discrete gradient method approximate the exact time-evolution of the gradient flow system, as represented by the explicit Euler method, rather well for step-sizes ∆τ ∈ {0.01, 0.1}.The IEnKF method leads to systematically different results in the variance which, however, correspond rather well to the exact posterior value for this particular problem.We also note that the IEnKF method also performs well for the larger step-size of ∆τ = 0.2.Overall the IEnKF method (68) emerges as the computationally cheapest and most accurate method among all the four tested time-stepping procedures.given by the explicit Euler method with step-size ∆τ = 0.00025.

4.2.
Particle flow Fokker-Planck dynamics.We repeat the experiments from the previous subsection using particle flow Fokker-Planck dynamics.While its implementation is more involved, particle flow Fokker-Planck dynamics have the advantage of being more generally applicable than the Kalman-based formulations.We also implement the particle flow Fokker-Planck dynamics for a sequential data assimilation procedure for the Lorenz-63 model (Lorenz 1963).
4.2.1.Linear problem.We use exactly the same experimental set up as already used for the EnKBF, that is, the measurement error is Gaussian with variance r = 0.02 the the prior is Gaussian with mean m 0 = 1/2 and variance σ 0 = 1.We now use M = 10 particles to approximate the (Gaussian) posterior distribution with a Gaussian mixture (33).The parameter α in ( 32) is set to α = 0.005.The particle system reaches fairly quickly a stationary configuration which approximates the first two moments of the posterior distribution rather well.See Figure 3.We also find that the discrete gradient method with θ = 1 relaxes to the stationary configuration within a single time-step even for large step-sizes ∆τ .The semi-implicit Euler method also performs rather well.
4.2.2.Nonlinear problem.We now come to a more challenging example with a non-Gaussian posterior distribution and consider the same nonlinear forward map and experimental setup as described in Section 4.1.2.Again, we use M = 100 particles and consider the discrete gradient method with θ = 1 and the semi-implicit Euler method.The reference solution is also provided by the explicit Euler with small step size ∆τ = 2.5 × 10 −6 .Here we consider step-sizes of ∆τ ∈ {0.002, 0.005, 0.02, 0.05}, see Figure 4 for results.The minimisation problem (44) is solved using the fminunc Matlab routine, which is based on a quasi-Newton method.The parameter α in ( 32) is set to α = 0.01.As for the linear example, we find that the discrete gradient method with θ = 1 converges rapidly to the stationary configuration with both the mean and the variance of the posterior distribution well approximated.The semi-implicit method also behaves well with a only slightly slower convergence rate at large step-sizes.
Remark 4.1.One could also implement the preconditioned formulation (37) with A({x j n }) being the empirical covariance matrix of the particle system times the ensemble size.This can imply significant computational savings for higher-dimensional problems if the kernel matrix B in ( 30) is also provided by an empirical covariance matrix.Furthermore, the explicit computation of gradients can be avoided similar to what is done in the IEnKF implementation (68).More precisely, consider a negative log-likelihood function of the form (10), then See also Garbuno-Inigo et al. (2019) for related ideas in the context of Brownian dynamics.However, the gradient flow structure is lost under this approximation.4.2.3.Lorenz-63.We use the experimental setting of Chustagulprom et al. (2016).Specifically, the Lorenz-63 differential equations (Lorenz 1963) with the standard parameters σ = 10, ρ = 28, and β = 8/3 are solved numerically with the implicit midpoint rule and step-size ∆t = 0.01.We only observe the first component of the state vector in observation intervals of ∆t obs = 0.12 with observation error variance R = 8.A total of K = 200, 000 assimilation cycles are performed.The ensemble size varies between M = 15 and M = 50 and particle rejuvenation is applied with β = 0.2 (Chustagulprom et al. 2016).The quality of the data assimilation method is assessed using the root mean squared error (RMSE) with N x = 3 and t k = 0.12k.Here x ref (t k ) ∈ R 3 denotes the reference solution and x a (t k ) the analysis mean at observation time t k .At any given time in the sequential data assimilation procedure, a forecast ensemble {x i f } is generated by propagating the posterior sample (also referred to as analysis ensemble) {x i a } from the previous time step through the discretised Lorenz-63 equations.The transformation of this forecast ensemble to the posterior for the next time step via Gaussian mixture approximations and the particle flow Fokker-Planck dynamics is conducted as follows.One first computes the empirical mean x f and the empirical covariance matrix P f from the forecast ensemble.Given a parameter value α ∈ (0, 1], a Gaussian mixture approximation to the forecast distribution is provided by  1. RMSE for sequential data assimilation applied to the Lorenz-63 model.The parameter α = 1 corresponds to a standard square root ensemble Kalman filter, while α < 1 results in a Gaussian mixture approximation to the prior and posterior distributions.Small improvements over the ensemble Kaman filter can be found for ensemble sizes M ≥ 35.as already discussed in Section 2.2.Given an observation model of the form (9) with linear forward map h(x) = Hx, the analysis (or posterior) distribution is also a Gaussian mixture We next use the particle flow Fokker-Planck dynamics (20) with π * = π a , kernel functions ψ(x) = n(x; 0, B a ) and initial particle positions x i 0 = x i a in order to find an equally weighted Gaussian mixture approximation to (73).Let us denote the resulting ensemble at a suitably chosen final time τ end by {x i * }.Then the analysis ensemble is finally given by (77) ).Note that α = 1 leads back to the standard ensemble Kalman filter (Evensen 2006, Reich & Cotter 2015), while one formally approaches a standard particle filter for α → 0. However, it should be noted that the particle flow Fokker-Planck dynamics is not welldefined for α → 0 and fixed ensemble size M .
The RMSEs for ensemble sizes M ∈ {15, . . ., 50} and parameters α ∈ {0.8, . . ., 1} in (72) can be found in Table 1.One finds that α < 1 leads to some improvements especially for larger ensemble sizes.Overall these improvements are, however, not as impressive as those obtained by the hybrid methods considered by Chustagulprom et al. (2016).

Summary
We have discussed two instances of gradient dynamics arising from Bayesian inference.The gradient flow structure lends itself naturally to discrete gradient time-stepping method.However, such methods are implicit and potentially costly to implement.We found that discrete gradient dynamics does not improve the behaviour of iterated ensemble Kalman filters such as the derivative-free formulation (68).The situation is different for the particle flow Fokker-Planck dynamics, where the discrete gradient method was shown to converge rapidly.This difference in behaviour can be explained by the fact that ensemble Kalman filter requires an accurate representation of its dynamics while the Fokker-Planck dynamics only requires a rapid transition to an equilibrium solution.Applications of particle flow Fokker-Planck dynamics to a Lorenz-63 sequential data assimilation problem showed small improvements over a standard ensemble Kalman filter.Further applications of these methods may arise in the context of machine learning.See, for example, Ollivier (2018), Kovachki & Stuart (2018).

Figure 3 .
Figure3.Exact final values and numerical approximations to the mean m t and variance σ t for the linear problem.Posterior is estimated by the Fokker-Planck dynamics with M = 10 particles using the discrete gradient (DG) method with θ = 1 and the semi-implicit (SI) Euler method with stepsizes (a) ∆τ = 0.004, (b) ∆τ = 0.01, (c) ∆τ = 0.04, (d) ∆τ = 0.1.The reference time evolution of the mean and variance is given by the Explicit Euler (EE) method with step-size ∆τ = 2 × 10 −4 .

Figure 4 .
Figure4.Exact final values and numerical approximations to the mean m t and variance σ t for the non-linear problem.Posterior is estimated by the Fokker-Planck dynamics using the discrete gradient (DG) method with θ = 1 and the semi-implicit (SI) Euler method with step-sizes (a) ∆τ = 0.002, (b) ∆τ = 0.005, (c) ∆τ = 0.02, (d) ∆τ = 0.05.The reference time evolution of the mean and variance is given by the Explicit Euler (EE) method with step-size ∆τ = 2.5 × 10 −6 .
f − y) T (HB f H T + R) −1 (Hx i f ) i f − K(Hx i f − y), K = B f H T (HB f H T + R) −1 ,and new covariance matrix (76) B a = B f − KHB f