Ensemble Kalman Inversion for nonlinear problems: weights, consistency, and variance bounds

Ensemble Kalman Inversion (EnKI), originally derived from Enseble Kalman Filter, is a popular sampling method for obtaining a target posterior distribution. It is, however, inconsistent when the forward map is nonlinear. Important Sampling (IS), on the other hand, ensures consistency at the expense of large variance of weights, leading to slow convergence of high moments. We propose a WEnKI, a weighted version of EnKI in this paper. It follows the same gradient flow as that of EnKI with a weight correction. Compared to EnKI, the new method is consistent, and compared with IS, the method has bounded weight variance. Both properties will be proved rigorously in this paper. We further discuss the stability of the underlying Fokker-Planck equation. This partially explains why EnKI, despite being inconsistent, performs well occasionally in nonlinear settings. Numerical evidence will be demonstrated at the end.


Introduction
How to sample from an intractable distribution is a classical challenge emerging from Bayesian statistics, machine learning, computational physics, among many other areas. Denote G : X → Y a forward map between separable Hilbert spaces X and Y. While the forward problem amounts to finding G(u) for every u ∈ X , the inverse problem amounts to reconstructing the unknown parameters u from the observation y. Sampling provides a probability perspective for such reconstruction procedure. Throughout the paper we set X = R L and Y = R K .
Let y be the collected data. It is generated from the forward map G acting on u with added Gaussian noise η that is assumed to be independent of u: y = G(u) + η , with η ∼ N (0, Γ) .
Throughout, we assume G is sufficiently smooth and its gradient is denoted by To find u using y, a typical approach is to perform minimization. We denote the least-squares functional Φ(·; y) : X → R by Φ(u; y) = 1 2 |y − G(u)| 2 Γ = 1 2 (y − G(u)) Γ −1 (y − G(u)) , then the optimal solution u * is simply the parameter that minimizes the mismatch: This approach however is unable to characterize the uncertainty of the estimation. In the Bayesian formulation, one takes a probability point of view, and regards u as a random variable. The aim is to reconstruct the probability distribution of u that combines the prior knowledge and the information from the collected data y. More explicitly, let ρ prior (u) be the prior distribution, then the posterior distribution of u, denoted by ρ pos , includes the prior distribution, modified by the likelihood function: ρ pos (u) = 1 Z exp (−Φ(u; y)) ρ prior (u) . The normalization constant Z is given by: Z := X exp (−Φ(u; y)) ρ prior (u) du , so that ρ pos (u) du = 1 .
This perspective provides the full landscape of u. While it provides more information, the computational cost is certainly more demanding. Sampling is one problem emerging under this framework: how to design a cheap numerical solver that generates (hopefully i.i.d.) samples from the target distribution (2)? In particular, suppose one can sample N particles in {u n } N n=1 ∈ X , and each particle is associated with a weight w n , then how to design the values for (u n , w n ) so that, in some sense N n=1 w n δ u n ≈ ρ pos ?
Many sampling algorithms have been proposed in literature, ranging from classical techniques such as Markov chain Monte Carlo to strategies based on interacting particles. Some set w n = 1 N for all n, while others use u n -dependent weights w n . We will explore the latter in this work.
There are two general guiding principles for designing of sampling algorithms: consistency and small variance.
• Consistency guarantees that the ensemble distribution is "equivalent" to the target posterior distribution, in the average sense: When tested on all smooth functions f , we require Here the E sign on the left hand side means taking expectation of all sampling configurations. Denote then we say µ is consistent with ρ pos , or µ ∼ ρ pos , if (4) holds true.
• Variance of the weights gives an indicator of the performance of the sampling algorithm, it measures how close each configuration of (5), from one run of the algorithm, is to the true, i.e. we would like an algorithm so that E N n=1 ω n f (u n ) − E ρpos (f ) 2 is small. (6) Once again the E sign takes expectation over all possible configurations from the sampling algorithm. For a bounded test function f , if {(ω n , u n )} N n=1 are i.i.d., then: where we use i.i.d. in the second equality and E(N ω 1 ) = 1 by consistency. This means the variance of the weight, Var(N ω 1 ), serves as a measure of the performance. If small, (6) holds true, and the algorithm is regarded as a good one. We note that some sampling algorithms cannot provide i.i.d. {w n , u n } pairs, making the inequality (7) not exactly true. It still provides a good estimate in the mean-field regime when N 1. There have been many successful algorithms developed in literature that aim at achieving these two properties. Our algorithms are built upon ideas from some of these methods, including "Importance Sampling" (IS) and "Ensemble Kalman Inversion/Square Root Filter" (EnKI/EnSRF), all three of which will be briefly recalled below and reviewed in more details in Section 2.
Importance Sampling is a rather standard technique: it involves assigning weights to particles so that an easy-to-be-sampled distribution can be turned into the target distribution. The weight is simply the ratio of the two. Regarding the two guiding principles, IS always achieves consistency, but it may give rise to high variance, especially when the easy-to-be-sampled and the target distribution are very different. Some approaches have been proposed to incorporate "re-sampling" to reduce the variance, such as the strategies used in [20,18]. We do not discuss the details.
The other end of the spectrum is EnKI and EnSRF, two algorithms that require the motion of the particles. The idea of these algorithms have its origin to the Kalman filter. They can be seen as the "analysis step" of data assimilation. Roughly speaking, the samples are generated from an easy-to-be-sampled distribution, and some dynamics is injected to move the samples around so that after finite time (usually 1) they look like i.i.d. samples from the posterior distribution. There are no weights involved at all, and each particles takes w n = 1 N , so the variance is always 0. However, these algorithms also inherit the disadvantages from Ensemble Kalman Filter, and highly rely on the Gaussianity assumption, that furthermore requires linearity of the forward map -for nonlinear forward map, the consistency is sacrificed.
Our goal in this work is to design algorithms by combining advantages of IS and EnKI/EnSRF. We rely on the introduction of the weights to achieve consistency, and the motion introduced in EnKI/EnSRF helps reducing the variance. In this way, we propose the Weighted-Ensemble-Kalman-Inversion (WEnKI) and Weighted-Ensemble-Square-Root-Filter (WEnSRF) as weighted versions of the EnKI and EnSRF that achieve consistency for general nonlinear forward maps. We also establish theoretical bounds of the weight variance for the proposed methods. In some sense, this work can be viewed as a correction to EnKI/EnSRF to ensure consistency and an improvement over IS in terms of reducing the weight variance. A natural question then is: how much improvement do we get? As a comparison to EnKI/EnSRF, this amounts to analyzing the strength of the weight term. This is a side product of the paper: by quantifying the deviation between EnKI and WEnKI using the weight term, we give an estimate of the error for EnKI when the forward map in nonlinear.
The rest of this paper is organized in the following: in Section 2, we give a brief review of the above mentioned three methods, Important Sampling, Ensemble Kalman Inversion, and Ensemble Square Root Filter. In Section 3 we propose our correction to EnKI and EnSRF with added weights. Proof of consistency and some discussion about the control of the variance of weights are presented in Section 4. In Section 5 we demonstrate numerical evidence. Some concluding remarks are presented at the end of the paper.

Importance Sampling and ensemble Kalman filter
We review a few sampling strategies in this section. In particular, the Importance Sampling that involves adding weights to the particles to achieve consistency, Ensemble Kalman Inversion and Ensemble Square Root Filter that involve adding motions to the particles so that samples are moved to represent the support of the target.
2.1. Importance sampling. The first sampling method we will discuss is the Importance Sampling [14]. It is a fundamental step in Sequential Monte Carlo Methods [11,10]. The idea is extremely simple: one samples a certain amount of particles from the prior distribution, and weight is then calculated based on the ratio of the posterior and the prior evaluation, so the samples with adjusted weights reflect the posterior distribution. The algorithm is summarized in Algorithm 1: It is expected that the newly updated distribution is consistent with the target distribution: N n=1 ω n δ u n ∼ ρ pos in the sense that for any smooth test function f : 2. Normalize weight: However, the variance of the weights could be quite large, especially when ρ pos and ρ prior concentrate at different regions. According to the formulation of the method, this quantity can be explicitly computed: Thus, if ρ pos is non-trivial is the region where ρ prior almost vanishes, the quantity can be extremely big, leading to poor performance of the algorithm. Various re-sampling strategies have been proposed [1] to reduce the high variance.

Ensemble Kalman filters.
At the other end of the spectrum of sampling method is to not adjust weights at all. Every particle takes equal weight 1 N . Two typical examples are Ensemble Kalman Inversion and Ensemble Square Root Filter.
The link between sampling and the Kalman filter problem was drawn in an inspiring paper [21]. Kalman filter (or its more practical version: ensemble Kalman filter) is a class of data assimilation methods that combine data (usually collected at discrete time) with some underlying guessed system dynamics an estimation of parameters in dynamical systems. The dynamics is ran till discrete time when data is collected, and Bayes' rule is applied to update the distribution of the unknown parameters. The paper views the application of the Bayes' rule as an action at a delta function in time, and by inserting a mollifier, the updating process becomes continuous in time.
Such idea was elaborated to treat the steady state Bayes' sampling problem in [15], in which an artificial time is added. The method views the prior and the target distribution to be two functions on a function space, and designs a PDE that transforms one to another (either in finite time or in the infinite time horizon). The sampling strategy is in some sense equivalent to the particle method for the PDE: the samples are drawn from the initial distribution, and follow the flow of the PDE by satisfying the associated coupled-ODE/SDE systems. The initial finite-time sampling method is termed "Ensemble Kalman Inversion (EnKI)" in [15], and some variations were developed that achieve the final distribution in infinite time, termed "Ensemble Kalman Sampling (EKS)" [9]. Since there are no adjustment of weights, the variance of weights keep being 0 throughout the dynamics. Indeed, upon the well-posedness results of the SDE obtained in [23,3], in [8] the authors proved, using the mean-field argument [19,5], that when the forward map G is linear, the method provides approximately i.i.d. samples for the posterior distribution (with N −1/2 error in L 2 -Wasserstein metric).
However, both the derivation of the PDE, and the mean-field limit argument, highly rely on Gaussianity. The forward map is required to be linear for the arguments to carry through. This is not a surprising property since the method was originally derived from Ensemble Kalman Filter and thus inherits its strong requirement: the "motion" of the particles only depend on the first two moments, and thus the method automatically fails when higher moments are necessary, as in the non-Gaussian case.
We describe both EnKI and EnSRF in details below.

2.2.1.
Ensemble Square Root filter. The PDE for the ensemble square root filter (EnSRF) writes as the following: where G(t), Cov up (t) are expectation of G and the covariance of (u, G(u)) in (u, t): For this particular PDE, one can show that if G is linear, namely: the solution to the PDE (9) is the target posterior distribution at t = 1: (u, 1) = ρ pos .
Noting that the PDE (9) is essentially an advection-type PDE, it is easy to formulate the ODE system satisfied by the particles by simply following the trajectory: with {u n }, i.i.d. sampled from ρ prior at t = 0. Since the particles {u n } follow exactly the same flow as the PDE, it is straightforward to have, for ∀t: This approximation sign holds true in both weak sense, and in Wasserstein distance sense, for all t ≤ 1: -Weak convergence, for all f (u) bounded continuous : Note that the rate of convergence in L 2 -Wasserstein depends on the dimension. It is of O(N −1/2 ) if the dimension of u is smaller than 4. Details can be found in [13]. However, in the numerical experiment, since one does not have (u, t), G and Cov up (t) are not available. In implementation these terms are replaced by the ensemble covariance and the ensemble mean: and Cov These replacements naturally bring error to realizations of (11). To prove such error is small, the classical mean-field argument is ran. The full recipe of the algorithm is summarized in Algorithm 2.
It is clear in the algorithm, (14) is simply the forward Euler solver applied on ODE (11) with time step being h = 1/M , and the accuracy would be the standard O(h). The method was proposed in papers [17,21,24] as a data assimilation method. The idea behind the scene is rather simple. Suppose a large 1. Define empirical means and covariance: 2. Update (set m → m + 1) end number of particles are sampled from a normal distribution N (µ 1 , Σ 1 ), and to form N (µ 2 , Σ 2 ), one merely needs to adjust u n to a new location: The newly formulated particles are then i.i.d. drawn from N (µ 2 , Σ 2 ). The ODE (11) is the continuous in time version of this motion. It is immediate that since only the information of the first two moments is used, Gaussianity is crucial, meaning for consistency, the forward map G is necessary to be linear.

Ensemble Kalman Inversion.
A similar approach is used to derive another sampling method called Ensemble Kalman Inversion [21,12]. The corresponding PDE is the following: where Cov up (t), and Cov pu (t) are covariance of (u, G) and (G, u) in (u, t). H u is the Hessian of . In [8] the authors showed that the solution to the PDE reconstructs the posterior distribution in finite time: if the forward map G is linear (10). So the PDE provides a smooth path to transform the prior distribution to the target in the linear setting. On the particle level, by following the trajectory of this PDE one has the following SDEs: where dW n t is the Brownian motion. In the implementation of this SDE, since is not available, the covariance matrices need to be replaced by the ensemble versions, as is done in (12). Also in [8], the authors used the mean-field argument to show that, in the weakly nonlinear case The discrete version of the coupled SDE (18) formulates Algorithm 3. It is apparent that (21) is simply the Euler-Maruyama method for (18), as rigorously justified in [16,4].
The method was initially proposed in [15], as a further development of [21], to find optimized parameter for inverse problem. Then continuous limit for the discretization in time was considered in [23]. The wellposedness of the resulting SDE system was shown in [3,4], and in [8] the authors showed the mean-field limit, namely, the convergence of the SDE system to the PDE in the weakly nonlinear case is proved, and that the PDE provides the target distribution only in the linear setting is also shown. Defending on the perspective, this is in fact a negative result for the weakly nonlinear case: the target distribution is not the

Algorithm 3 Ensemble Kalman Inversion
Preparation: 1. Input: N 1; h 1 (time step); M = 1/h (stopping index); Γ; G (forward map) and y (data). 2. Initial: {u n 0 } N n=1 sampled from initial distribution ρ prior . Run: Set time step m = 0; While m < M : 1. Define empirical means and covariance: 2. Artificially perturb data (with ξ n 3. Update (set m → m + 1) solution to the PDE, but the method nevertheless presents the flow to the PDE, so the method does not give a consistent sampling of the target distribution. We also note that often in time, people view EnKI as an optimization algorithm instead of a sampling algorithm, and some relaxation terms have been added for convergence to the minimizer [6, 7].

Summary.
It is rather clear that in IS, the particles are kept in the original location, and one merely adjusts the weights. This guarantees the consistency, namely, (4) always holds true for all bounded continuous functions. On the other hand, since the particles do not move, the weights could be largely suppressed or enlarged, leading to large variance of the weights even in the Gaussian case. On the contrary, the later two algorithms, EnSRF and EnKI, move particles around to adjust the change of center and the variance. Since all particles are equally weighted, the variance is kept at 0. However, the derivation of both methods assumes the Gaussianity, and thus the consistency fails for the nonlinear forward map.

Weighted Ensemble Kalman Inversion and Square root filter
Our proposed algorithms combine the advantages of IS and EnKI/EnSRF, by including both weight and particle dynamics simultaneously, so that we guarantee the consistency at the expense of fairly small variance. The output of the algorithms would be an ensemble distribution having the format of as an approximation to the target distribution ρ pos . We call the proposed algorithms weighted-EnKI (WEnKI) and weighted-EnSRF (WEnSRF). As the names suggest, we largely keep the format of the flow (or the PDE) for EnKI and EnSRF, while we also add weights to achieve consistency. The underlying flow is designed so that the PDE solution provides a linear interpolation on the log-scale in a time parameter t, from the prior to the posterior distributions [15,23]: where Z(t) is a function in time to normalize ρ(u, t) so that It is clear that ρ(u, 0) = ρ prior , ρ(u, 1) = ρ pos , so the definition (23) provides a flow from the prior to the target posterior distribution. The prior distribution in our algorithm can be quite flexible, for example, for a C 2 function V (u), with Z being the normalization factor. In practice, however, the prior distribution needs to be an distribution that is easy to sample, so for now we assume: The strategy we follow is divided into two steps: Step 1: adjust the PDE (9) and (16) by adding weights so that (23) is a strong solution; Step 2: design a corresponding particle system that carries out the flow of the PDE.
Before diving into details of the algorithms, we first introduce some notations. A straightforward but somewhat tedious calculation yields, for ρ(u, t) defined in (23): where H u denotes the Hessian with respect to u, and V ∈ R L×1 , W ∈ R L×L are defined as 3.1. Weighted ensemble square root filter (WEnSRF). Calculating the left hand side of (9) using the identities (25)-(27), we arrive at the PDE that ρ, defined in (23), satisfies: where with shorthand notations , According to the derivation, it is a natural expectation that ρ is a strong solution. We will further show that the ensemble distribution of particles generated by the sampling method gives a weak solution to the PDE.
The PDE (30) can be solved using standard method of characteristics, which gives arise to the following coupled ODE system for the particles, with u n t denoting the location of the n-th particle at time t, and w n t the associated weight: The initial condition is chosen so that {u n 0 } N n=1 is i.i.d. sampled from ρ prior (u) du and w n 0 = 1/N , n = 1, . . . , N , to represent initial data (u, 0) = ρ prior . The system is decoupled, and {u n , w n } pairs are independent from each other. The output of the algorithm is the empirical distribution: In practice, (t) is unknown and thus Cov (t) and G (t) in (33) have to be replaced by the ensemble versions: and This replacement makes the SDE system tangled up. We summarize the method in Algorithm 4. Note that due to numerical error, it is typically hard to keep the summation of the weight 1, and numerically one performs normalization at each time step. Some properties of the method such as the consistency and the boundedness of the variance will be shown in Section 4.

Algorithm 4 Weighted Ensemble Square Root Filter
Preparation: 1. Define empirical means and covariance: 2. Update parameters: 3. Update (set m → m + 1): for all 1 ≤ n ≤ N : 3.2. Weighted Ensemble Kalman Inversion (WEnKI). The same strategy can be applied to modify EnKI to cope with nonlinearity. Substituting (25)-(27) into (16), we have where L is a linear operator inherited from (16): and the remaining terms R 1 , R 2 , R 3 are given by where Cov pp , Cov up , and Cov pu are the corresponding covariance matrices, as defined in (32). Similar to WEnSRF, we arrive at the following decoupled SDE system: where the Brownian motion is introduced for the second order term in L. The initial condition is chosen so that {u n 0 } N n=1 is i.i.d. sampled from ρ prior (u) and w n 0 = 1/N , n = 1, . . . , N . Since is unknown, as in (35)-(36), we once again replace the true covariance by the ensemble version, and define empirical distribution accordingly: There are two sources of randomness involved in WEnKI: the initial sampling and the Brownian motion in (40). Let Ω be the sample space and F 0 be the σ-algebra: σ (u n (t = 0), 1 ≤ n ≤ N ), then the filtration is introduced by the dynamics: . It can be shown the SDE is well-posed in this σ-algebra [8,3]. In next section, we will prove that the empirical distribution is consistent with ρ(u, t) defined in (23) under the expectation sense in F t . We will also give control to the variance. The method is summarized in Algorithm 5. As in the previous algorithm, the numerical error induces n w n = 1, and an extra renormalization is conducted.
Remark 3.1. It is important to note that the method is different from running EnKI to time t = 1 and then apply Important Sampling. The latter was proposed in [20] as a weighted version of Ensemble Kalman Filter [12], known as the Weighted Ensemble Kalman Filter (WEnKF).
Define the conditional mean and the conditional covariance: . In WEnKF, the particle weight is updated according to: where u n 0 are the initial samples according to the prior distribution, and N (·; E(u n | u n 0 ), Var(u n | u n 0 )) is the Gaussian distribution centered at the conditional mean. The major difference, compared with the one we propose in (40), is that the covariance used in (42) is calculated completely from the initial data. The updates along the evolution is entirely ignored. The updating formula in (40), however, involves the weights that evolve in time and is closer to the PDE solution (30).
2. Define first and second derivative:

Properties of WEnKI and WEnSRF
We establish a few important properties of WEnKI and WEnSRF in this section. As argued in Section 2, the two guiding principles for the algorithm-design is consistency and small variance of the weights. These two properties are presented in §4.1 and §4.2 respectively. Furthermore, we study the difference between WEnKI and EnKI in §4.3, and provide some intuition for EnKI performing well sometimes, even when G is nonlinear.

4.1.
Consistency. The most important property is the consistency, namely, on average, the ensemble mean tested on any smooth function is the same as the real mean. Since the PDEs are obtained by forcing (23) to be the solution, the consistency is expected.
We first present the theorem for WEnSRF.
Theorem 4.1. Assume G : R L → R K is a C 1 function, then: • the formula (23) is a strong solution to (30) with the initial condition (u, 0) = ρ prior , namely, the PDE (30) smoothly connects the prior and the posterior distributions; • the formula (33)-(34) is a weak solution to (30) with the initial condition (u, 0) = M u0 (u), namely, the ODE system (33) follows the flow of the transition: for any smooth test function f : R L → R and 0 ≤ t ≤ 1, we have consistency: where the E on the outer layer of the right hand side comes from the random configuration of the initial condition for {u n 0 }. Proof. The first point is trivial: it amounts to substituting the solution (23) into the equation and balancing terms. To show that the empirical measure M u n is the weak solution to the PDE, we test it with a smooth function f (u). Note that This is exactly the weak formulation of (30) tested on f with the integration by parts applied on the advection term. To show (44), we simply note that both ρ and = M ut are weak solutions.
The same type of theorem holds true for WEnKI: where the outer-layer E on the right hand side is taken in the probability space (Ω, F t , P).
Proof. The first part is again trivial. To show (45), we first realize that the equation holds trivially for t = 0 since {u n 0 } N n=1 are i.i.d sampled from ρ prior (u). For all t > 0, we plug in (40) and apply the Itô's formula on This is exactly the weak formulation of (37) tested on f with integration by parts moving the ∇ and H onto f . The equality (45) follows as ρ(u, t) defined in (23) is also a weak solution.

4.2.
Bounding the variance of weights. We investigate the behavior of the weights for a fairly large class of G. Throughout this subsection, we will impose one of the two assumptions on the forward map G below.
The first assumption is rather weak, and it only requires the boundedness of derivatives of G up to second order.
Assumption 4.1. G : R L → R K is C 2 function and there exists Λ > 0 such that The second assumption is slightly stronger, and it asks for the structure of the range of the linear and nonlinear components of G.

Assumption 4.2. G is weakly-nonlinear in the sense that there exists a matrix
where m(u) is a C 2 bounded Lipschitz function from R L to R K satisfying and there exists constants Λ, Λ 1 and M such that: If the second assumption holds true, we call the optimal solution for the linear part: and the associated residue r = y − Au * A .
(51) We also define the Gaussian part of the distribution so that we have This ρ A has expectation and the covariance matrix: It is immediate that the second assumption is stronger than the first one, and thus one would expect a tighter bound. Indeed, by comparing Theorem 4.3 and Theorem 4.4, we see that the variance of weights is bounded by a constant that exponentially grows with respect to |y|, the data, when only Assumption 4.1 holds true, but is bounded by a constant independent of |y| when Assumption 4.2 also holds.
Since EnKI is a more popular method than EnSRF, the analysis is conducted on WEnKI mainly. Similar analysis could potentially be applied to deal with WEnSRF but could be more delicate. We do not pursue it in this paper.
We also note that the analysis is conducted on the dynamics (40) with coefficients calculated from the exact density, and thus each particle is evolved independently (this is in the spirit of the McKean-Vlasov dynamics or propagation of chaos, expected in the mean field limit). The analysis for the numerical version, with all the covariance matrices replaced by the ensemble ones as seen in (35)-(36) will be left for future works.

4.2.1.
Bounded nonlinearity under Assumption 4.1. We first prove a lemma to bound covariance matrix of (u, t). Proof. By Theorem 4.2, = ρ(u, t) defined in (23) is a strong solution to the PDE. Taking partial derivative with respect to t and rewriting (25), we get Multiplying |y − G| 2 Γ on both sides and taking integral yields d dt which concludes the lemma. and |y| only such that for 0 ≤ t ≤ 1: In the proof below, we use C to denote a generic constant that changes from line to line, and we keep track of the constant's dependence on different argument. However, we do not specify the form of the dependence.
Proof. Consider (u, 0) = ρ prior = exp −|u − u 0 | 2 Γ0 , we expand G around 0 and utilize the bound (46) for: Therefore, where the last line defines constants C 1 and C 2 , which only depend on Λ, |u 0 |, Γ −1 Multiplying |G(u)| 2 and |u| 2 on both sides of (54), we get d dt where we use Lemma 4.1 in the second inequalities. By Grönwall inequality, we have: Choose C to be the bigger value of the two with t = 1, we conclude the lemma.
The immediate consequence of Lemma 4.1 and Lemma 4.2 is the boundedness of the covariance matrices: and |y| such that for 0 ≤ t ≤ 1 where Cov (t) uu , Cov (t) up , Cov (t) pp are the corresponding covariance matrices, as defined in (32). These a priori estimates are now used to bound the variance of the weights. 2 and |y| such that for any 0 ≤ t ≤ 1. Var(N ω n t ) ≤ C . Remark 4.1. We note that the result in Theorem 4.3 is not optimal. The constant, if traced carefully, blows up as |y| → ∞ with a rate of at least e |y| 2 , as suggested in (57). Essentially this result does not demonstrate WEnKI superior than the classical IS. However, as will be shown in Theorem 4.4, under a stronger assumption (Assumption 4.2), the dependence on y could be removed.
Multiplying ω n on both sides of the second equation of (40) and taking expectation, we have d dt where we have omitted the last three terms in R 2 because the sum of them is negative. We then bound the three terms in bracket separately. As a preparation, we note that Apply these inequalities to estimate R k defined in (39), we arrive at the following bounds.
where the last inequality comes from Corollary 4.1.
For the non-negative contribution from R 2 , we have where the last inequality comes from Lemma 4.2. Finally, for R 3 , we have where we have used Corollary 4.1, and that, by definition of W, All the constants above depend on Λ, |u 0 |, Γ −1 0 2 , Γ −1 2 and |y|. Substitute these into (60), we have dE|ω n t | 2 ≤ CE|ω n t | 2 . Realizing that ω n 0 = 1 N so that E|ω n 0 | 2 = 1 N 2 , we obtain E|ω n t | 2 ≤ e Ct N 2 . This concludes (59) and this theorem.

Weak nonlinearity under Assumption 4.2.
The variance bound can be improved when we assume further structure of the nonlinearity, namely, when the nonlinear component m(u) is perpendicular to the range of the linear component A, weighted by Γ −1/2 . In particular, the bound becomes independent of y, as shown in the following theorem.
where C 1 only depends on Λ, Γ −1 This theorem is a counterpart of Theorem 4.3, but stronger assumption on the nonlinearity is added. As a result, the variance of weight is bounded, independent of y. In the most extreme case, suppose G is entirely linear, Λ 1 = 0, then according to the theorem, the variance is bounded by a fixed constant. As a comparison, if one applies Important Sampling directly, for large y and thus large u * , the variance blows up at the order of O(e |u * | 2 ), equivalently to O(e |y| 2 ) for reasonably conditioned A. This means that under mild conditions (Assumption 4.2), the newly proposed WEnKI method significantly reduces the weight variance from the classical method IS.
The proof of the theorem is largely based on the following calculation.
The proof for the proposition is deferred to Appendix A. We now give the proof for Theorem 4.4 based on the above proposition.
Then (66) is a direct consequence, concluding the theorem.

4.3.
EnKI with nonlinear forward map. In this section, we study a slightly different topic: how different are WEnKI and EnKI? In fact, it was proved in [8] that EnKI is not a consistent sampling method when the forward map is nonlinear. The algorithm, without the weight, can be regarded as the discrete version of PDE (16), but the target distribution ρ(u, t) is not the solution to the PDE, and hence EnKI is inconsistent. It is numerically observed, however, that despite being inconsistent, EnKI mysteriously performs rather well [21], especially when the target distribution is almost Gaussian-like, no matter how nonlinear G is, also see the book [22] for more examples. To the best of our knowledge, such discrepancy in terms of theoretical and practical performance, has not been addressed in literature. In this subsection, as a first attempt to explain it, we provide one criterion, under which, EnKI performs similarly well as WEnKI.
The argument in the end comes down to comparing the continuous version of WEnKI and EnKI, two Fokker-Planck equations, with the former one having a weight term while the latter not.
Once again we denote ρ the target distribution, defined in (23) and proved to be the solution to equation (37) in Theorem 4.2, and let the solution to the Fokker-Planck equation without the weight: where Cov up (t), and Cov pu (t) are covariance of (u, G) and (G, u) in (u, t). It was proved in [8] that (68) is the mean-field limit of EnKI.
We will now show that ρ and are close when the weight term (defined in (39)) is small. This means that WEnKI and EnKI give more or less the same results when the weight term is small. We recall the bounded Lipschitz metric (d BL ) between probability measures: Since the admissible set in the supremum is smaller than the class of Lipschitz-1 function and 1-bounded function, this metric can be bounded by L 2 -Wasserstein distance W 2 (µ, ν) and total variation TV(µ, ν) We have the following theorem characterizing the difference between and ρ, i.e., EnKI and WEnKI (that is consistent to the target distribution).
This theorem states that the size of the weight gives control over the distance between ρ and . To compare them, we introduce an intermediate surrogate ρ, given by   where Cov ρ(t) up (t) and Cov ρ(t) pu (t) are given by ρ(u, t). We will bound d BL (ρ, ρ) and d BL ( ρ, ) in the following two propositions. The theorem is a direct consequence of the two. and |y|, such that for all 0 ≤ t ≤ 1. and |y|, such that for all 0 ≤ t ≤ 1.
Proof of Proposition 4.2. The proof is based on the following construction of particle system. Let with initial data u 0 sampled from µ prior and w 0 = 1. This is a Langevin dynamics, so that for any test function f :

It is clear from second equality in (74):
w t > 0 for all t, and that The boundedness (72) is a direct result by using L ∞ test function to bound total variation: Proof of Proposition 4.3. We first state and prove an estimate of the difference of covariance for all 0 ≤ t ≤ 1. For this, we first bound E|u t | 2 |w t − 1| using Itô's formula and (74): Taking expectation on both sides, we have where we use Corollary 4.1 and equation (58). By Grönwall's inequality and w 0 = 1, we get and (77) for any t ≤ 1. Combining (76) and (77), we have which proves (75).
We now come back to the Proposition to prove (73). We use two particle systems to represent (71) and (68). Let where the initial data u 0 is sampled from ρ prior (u), and let with the same initial data v 0 = u 0 . Then immediately To show the theorem, it suffices to prove for all t and then utilize (75).

Numerical results
In this section, we show some numerical evidence to demonstrate the superiority of the proposed method. All numerical examples are highly nonlinear, so we are away from the known "safe zone" where EnKI and EnSRF work both perfect. We remark that we conduct numerical experiments only in low dimensional setting to have a clear illustration of the behavior of the algorithm. 5.1. One dimension example. As a start, we first test out the 1D case. We set the normal distribution N (0, 1) as the prior distribution.
• Example 1: In this example we set G(u) = 4 cos(2(u − 3)) + sin(u − 3) and the data (with only one observation) is given at y = 0. The posterior distribution is a multimodal distributions, as shown in Figure 1. The number of samples is set to be N = 1000, and in WEnKI and WEnSRF, we choose the time step ∆t = 10 −3 . As a comparison, we plot the result using WEnKF (Remark 3.1) and Important Sampling, EnKI and EnSRF. In this example, the prior and the posterior distributions share supports, so IS and WEnKF, the two methods that achieve consistency, behave relatively well. But due to nonlinearity and non-Gaussianity, EnKI and EnSRF, the two methods that tend to give one-mode Gaussian-like profile, fail.
• Example 2: This is a highly nonlinear example with 4-th power in G: G(u) = (u − 3) 4 − 1, and data is still set to be y = 0. N = 2000 and ∆t = 10 −6 . WEnKI and WEnSRF clearly outperform the others, see Figure 2. Note that in the experiment we find that for stability of the Euler solver (33), and (40), the time step is chosen to be rather small.
• Example 3: In this example we set G(u) = (u − 5) 2 and data y = 0. N = 2000 and ∆t = 10 −3 . The posterior distribution has one peak, but is non-Gaussian. While the center of the prior is at 0, the center of the likelihood function is at u = 5: so there is a big shift of support from the prior to the posterior distribution. As seen in Figure 3, both WEnKI and WEnSRF still capture the posterior distribution rather well. EnKI and EnSRF cannot capture the entire profile, but at least can move to fit relatively accurate support. WEnKF and IS completely fail. To quantitatively study the behavior, we numerically compute the weights variance of all three methods (WEnKI, WEnSRF, and IS). (For IS, the weight at t is calculated using ρ(u, t) (defined in (23)).) In Figure 4, we plot evolution of the weight variance with respect to t in log scale (shifted by 1 for positivity). It shows the variance of weights in IS quickly blows up in time, while the quantity for the other two keep reasonably bounded. As a demonstration of consistency, we compare moments computed using the four methods, and the reference solution, as shown in Table 1. It is clear that despite EnKI and EnSRF are visually close to the groundtruth solution, the errors in the moments are still rather large. This is expected: the groundtruth solution is not a Gaussian distribution, but the underlying assumption for EnKI and EnSRF to be valid is the Gaussianity. 5.2. Two dimension example. We also present some 2-D examples. Normal distribution N (0, I 2 ) is chosen as the prior distribution.
• Example 4: We consider likelihood function exp(−Φ(u; y)) = 1 4 This design of likelihood function induces two separate centers, as shown in Figure 5. Here, we use N = 2000 and choose ∆t = 10 −4 for WEnKI and WEnSRF. They capture the motion of the particles accurately. In comparison, IS loses a lot of particles. In this multimodal example, EnKI and EnSRF fail as expected due to the Gaussian assumption.    • Example 5: In this case we consider G(u 1 , u 2 ) = (g 1 (u 1 , u 2 ), g 2 (u 1 , u 2 )) and y = (0, 0), where N = 1000 and ∆t = 10 −3 for WEnKI and WEnSRF. Results are presented in Figure 6. Due to the form of G, the center of the likelihood function is (2, 2) instead of (0, 0) for the prior distribution. Such transition of support is hard for IS to capture. After resampling, only a few samples survive. Visually EnKI and EnSRF still give satisfying results.
To quantitatively understand the performance of the algorithms, we compute the variance of weight (81) and accuracy of moments estimation in this example. The variance of the weight, as a function of time, is plotted in Figure 7 in log scale (shifted by 1 for positivity). As can be seen clearly, the weight of IS blows up quickly while the two newly proposed methods stay reasonable. In Table 2, we tabulate the error of higher moments. Even though EnKI and EnSRF are visually good methods, in comparison, they do not capture the moments as well as their weighted versions.

Conclusion
We conclude the paper with a few remarks of the proposed algorithms.
Since EnKI was proposed in [15], the mystery of if and how it works for the nonlinear case has attracted a lot of attention. The surrounding work, such as the wellposedness of the coupled SDE [4], the wellposedness of the PDE [8,9], the mean-field limit of SDE to the PDE, and the convergence rate [8,13], and convergence as an optimization method [6,7], have all been studied in depth, and the use of similar idea leads to development of new algorithms [18,9]. The investigation into the core sampling problem with nonlinear forward map, however, is thin.
In this paper, by adding the weights to the particles, we are able to correct EnKI (and similarly EnSRF) to ensure the consistency of the algorithm for nonlinear G. The derivation, though tedious, is mathematically straightforward. The resulting "weight" factor (39), however, is mathematically messy and physically not intuitive at all. We would like to emphasize that this nonphysical weight term is uniquely determined once the flow is set, namely, if one follows the flow of EnKI, du n = Cov up Γ −1 (y − G(u n t )) dt + Cov up Γ −1/2 dW t , so that in time, the flow provides a linear interpolation between the origin and the target on the logarithmic scale: ρ(u, t) ∼ µ prior exp{−t|y − G(u)| 2 Γ /2} , then there will be no other ways to define the weight term, and it has to be as tedious and nonphysical as was derived in this paper. This naturally leads to the question if some modifications to the flow can result a physically more meaningful weight function. This, however, is beyond the scope of the current paper.
Appendix A. Proof of Proposition 4.1 In this appendix, we derive the explicit bound for R i in the following lemma, and show Proof of Proposition 4.1. It plays the crucial role in Theorem 4.4.