Fem-based Discretization-invariant Mcmc Methods for Pde-constrained Bayesian Inverse Problems

We present a systematic construction of FEM-based dimension-independent (discretization-invariant) Markov chain Monte Carlo (MCMC) approaches to explore PDE-constrained Bayesian inverse problems in infinite dimensional parameter spaces. In particular, we consider two frameworks to achieve this goal: Metropolize-then-discretize and discretize-then-Metropolize. The former refers to the method of discretizing function-space MCMC methods. The latter, on the other hand, first discretizes the Bayesian inverse problem and then proposes MCMC methods for the resulting discretized posterior probability density. In general, these two frameworks do not commute , that is, the resulting finite dimensional MCMC algorithms are not identical. The discretization step of the former may not be trivial since it involves both numerical analysis and probability theory, while the latter , perhaps " easier " , may not be discretization-invariant using traditional approaches. This paper constructively develops finite element (FEM) dis-cretization schemes for both frameworks and shows that both commutativity and discretization-invariant are attained. In particular, it shows how to construct discretize-then-Metropolize approaches for both Metropolis-adjusted Langevin algorithm and hybrid Monte Carlo method that commute with their Metropolize-then-discretize counterparts. The key that enables this achievement is a proper FEM discretization of the prior, the likelihood, and the Bayes' formula, together with a correct definition of quantities such as gradient and covariance matrix in discretized finite dimensional parameter spaces. The implication is that practitioners can take advantage of the developments in this paper to straightforwardly construct discretization-invariant discretize-then-Metropolize MCMC for large-scale inverse problems. Numerical results for one-and two-dimensional elliptic inverse problems with up to 17899 parameters are presented to support the proposed developments.


Introduction
Solving large-scale ill-posed inverse problems that are governed by partial differential equations (PDEs) is, though tremendously challenging, of great practical importance in science and industry.Classical deterministic inverse methodologies, which provide point estimates of the solution, are not capable of rigorously accounting for the uncertainty in the inverse solution.The Bayesian formulation provides a systematic quantification of uncertainty by posing inverse problem as one of statistical inference.The Bayesian framework for inverse problems proceeds as follows: given observational data d ∈ R K and their uncertainty, the governing forward problem and its uncertainty, and a prior probability density function describing uncertainty in the parameters u ∈ R N , the solution of the inverse problems is the posterior probability distribution π (u|d) over the parameters.Bayes' Theorem explicitly gives the posterior density as π (u|d) ∝ π like (d|u) × π prior (u) which updates the prior knowledge π prior (u) using the likelihood π like (d|u).The prior encodes any knowledge or assumptions about the parameter space that we may wish to impose before any data are observed, while the likelihood explicitly represents the probability that a given set of parameters u might give rise to the observed data d.
Even when the prior and noise probability distributions are Gaussian, the posterior need not be Gaussian, due to the nonlinearity embedded in the likelihood.For large-scale inverse problems, exploring non-Gaussian posterior in high dimensions to compute statistics is a grand challenging problem since evaluating the posterior at each point in the parameter space requires a solution of the forward PDEs.Numerical quadrature to compute the mean and covariance matrix, for example, is generally infeasible in high dimensions.Usually, the method of choice for computing statistics is Markov chain Monte Carlo (MCMC), which judiciously samples the posterior distribution, so that sample statistics can be used to approximate the exact ones.
Perhaps the Metropolis-Hastings (MH) algorithm, first developed by Metropolis et al. [1] and then generalized by Hastings [2], is the most popular MCMC method.Its popularity and attractiveness come from the easiness in implementation and minimal requirements on the target density and the proposal density [3,4].One of the simplest instances of the MH algorithm is the random walk Metropolis-Hastings (RWMH) which can be considered as the Euler-Maruyama discretization of a stochastic ordinary differential equation with step size ∆t = σ 2 .The RWMH method is simple except for a small detail: how to choose the optimal step size σ 2 ?
Choosing the time step, also known as the proposal variance, σ 2 optimally is vital since it determines the mixing time which is approximately the number of steps to explore the stationary distribution.If σ 2 is too small, it is most likely that all the proposed moves are accepted, but the chain explores π (u|d) very slowly since the proposed jump is small.On the other hand, if the proposal variance is large, it is most likely that the proposed move is in low probability regions, and hence rejected.This case also leads to slow mixing since the chain virtually does not move at all.As a result, the proposal variance should be in between these extremes, and this is known as the Goldilocks principle [5].
It turns out that the proposal variance for RWMH must scale like σ 2 = 2 N −1 , with some constant , for the average acceptance rate to be bounded away from zero as the dimension N approaches infinity [6,7].Similarly, one can show that the optimal scaling for the Metropolis-adjusted Langevin algorithms (MALA) [8] is σ 2 = 2 N −1/3 , and σ 2 = 2 N −1/4 for hybrid Monte Carlo (HMC) methods [9].Clearly, all these methods suffer from vanishingly small proposal variance as the parameter dimension increases.As such they are unlikely to be useful for Bayesian posteriors governed by PDEs in which mesh refinement, and hence increasingly higher parameter dimension, is often needed to resolve important physical features of the problem under consideration.
Meanwhile, efforts in designing MCMC methods to explore probability measures in infinite dimensional spaces have been greatly successful, including the creation of random walk Metropolis-Hasting methods [10,11], Metropolis-adjusted Langevin algorithms [10,11], and hybrid Monte Carlo methods [12,13] in function spaces.We refer to these as function-space MCMC methods.The appealing feature of these methods is that, if discretized properly, their performance, particularly the acceptance rate, is independent of the parameter dimension, and hence the mesh size.Thus, they are perhaps one of the viable MCMC options for large-scale PDE-constrained Bayesian inverse problems in infinite dimensional parameter spaces.
To be amenable for computer implementations, function-space MCMC methods on infinite dimensional spaces need to be discretized.Finite difference method together with Karhunen-Loève truncation [10,11,12,13] has been proposed to discretize function-space MCMC methods.Care must be taken in designing discretization schemes so that they preserve/inherit the dimension-independent, also called discretization-invariant, property.While discretization of forward PDE using FEM is widespread, a systematic FEM discretization of function-space MCMCs is not popular, and this limits the potential impact of these scalable MCMC approaches on PDE-constrained inverse problems.On the other hand, a more straightforward approach is to first discretize the PDE and the parameter space, form a Bayesian posterior for discretized inverse problems and then propose a finite dimensional MCMC method.Naive discretization schemes may lead to MCMC methods whose performance deteriorates as the parameter dimension increases, e.g. when the mesh is refined, [10,11,12,13].How to constructively develop scalable MCMC methods for this "easier" route in the context of FEM methods has attracted little attention and this may prevent practitioners from constructing dimension-independent MCMC approaches for large-scale problems.
The main objective of this paper is to address the aforementioned issues.In particular, the goal is to construct FEM-based discretization-invariant MCMC methods for PDE-constrained Bayesian inverse problems in infinite dimensional parameter spaces.To that end, we first present an inverse problem governed by elliptic PDEs in Section 2 together with a well-defined infinite dimensional Bayesian setting with prior Gaussian measure.The task at hand is to explore the Bayesian inverse problem using MCMC techniques.To accomplish this, we consider two frameworks: Metropolize-then-discretize and discretize-then-Metropolize.The former refers to the approach of first proposing a function-space MCMC method for the Bayes' posterior measure in infinite dimensional spaces and then discretizing both of them.The latter, on the other hand, first discretizes the Bayesian inverse problem and then proposes MCMC methods for the resulting discretized posterior probability density in finite dimensional spaces.
In general, Metropolize-then-discretize and discretize-then-Metropolize do not commute, that is, the resulting finite dimensional algorithms are not identical.As shall be shown, the discretization step of the former may not be trivial since it involves both numerical analysis and probability theory, while the latter may not be dimension-independent using standard approaches.In this paper, we develop dimension-independent finite element discretizations for both frameworks.In particular, we show how to construct discretizethen-Metropolize approaches for both Metropolis-adjusted Langevin algorithm and hybrid Monte Carlo method that commute with their Metropolizethen-discretize counterparts, and hence are discretization-invariant.The key that facilitates our developments is a proper FEM discretization of the prior, the likelihood, and the Bayes' formula, together with a correct definition of quantities such as gradient and covariance matrix in discretized finite dimensional parameter spaces.We next summarize some important details of each section of the paper.
Section 3 presents a Metropolize-then-discretize (MTD) framework for both function-space Metropolis-adjusted Langevin algorithm (denoted as FMALA) and function-space hybrid Monte Carlo (denoted as FHMC) method.In particular, we first summarize FMALA in Section 3.1 and FHMC in Section 3.2.Since either of the methods requires the gradient of the negative logarithm of the likelihood, we present an adjoint approach to compute the gradient efficiently in Section 3.3.We next present in detail the discretization of the prior, the likelihood, the posterior, the FMALA, and the FHMC in Section 3.4.At the heart of our discretization scheme is the combination of the Galerkin finite element method, the matrix transfer technique, the Karhunen-Loève expansion, and some functional analysis.We then discuss at length the "easier" route, namely, the discretize-then-Metropolize approach in Section 4. Specifically, we present an R N -view discretization of the posterior measure in Section 4.1, finite element gradient in Section 4.2, finite dimensional MALA algorithms in Section 4.3, and finite dimensional HMC algorithms in Section 4.4.The key finding in Section 4 is that a proper construction of finite dimensional MCMC methods taking into account the mathematical structure of the problem under consideration are discretization-invariant while standard approaches may not.Various numerical results to validate our developments are presented in Section 5 and conclusions together with future work are discussed in Section 6.Finally, we close the paper with an appendix on the Hessian, its various definitions, and its computation for future work on Hessian-based discretization-invariant MCMC methods.

Problem Statement
In this paper, though our proposed framework is valid for Bayesian inverse problems governed by any partial differential equations (PDEs), we restrict ourselves to those with elliptic PDEs for simplicity.For concreteness, we consider heat conduction problem governed by the following elliptic PDE in an open and bounded domain Ω ⊂ R n , n = {1, 2}: where w is the forward state, u the logarithm of distributed thermal conductivity on Ω, n the unit outward normal on ∂Ω, and Bi the Biot number.
Here, Γ R is a portion of the boundary ∂Ω on which the inflow heat flux is 1.
The rest of the boundary is assumed to have Robin boundary condition.
In the forward problem, the task is to solve for the temperature distribution w given a description of distributed parameter u.In the inverse problem, the task is to reconstruct u given some available observations, e.g, temperature observed at some parts/locations of the domain Ω.We choose to cast the inverse problem into the framework of PDE-constrained optimization.To begin, let us consider the following additive noise-corrupted pointwise observation model where K is the total number of observation locations, {x j } K j=1 the set of points at which w is observed, η j the additive noise, and d j the actual noisecorrupted observations.In this paper we work with synthetic observations and hence there is no model inadequacy in (1).Concatenating all the observations, one can rewrite (1) as with F := [w (x 1 ) , . . ., w (x K )] T denoting the map (known as the forward or parameter-to-observable map) from the distributed parameter u to the noisefree observations, η being random numbers normally distributed by N (0, L) with bounded covariance matrix L, and d = [d 1 , . . ., d K ] T .For simplicity, we take L = σ 2 I, where I is the identity matrix.For notational convenience, we use boldface letters for vectors and matrices.
Our inverse problem can be now formulated as subject to the forward problem where denotes the weighted Euclidean norm induced by the canonical inner product •, • R K in R K .This optimization problem is however ill-posed.An intuitive reason is that the dimension of observations d is much smaller than that of the parameter u (typically infinite before discretization), and hence they provide limited information about the distributed parameter u.As a result, the null space of the Jacobian of the parameter-to-observable map F is non-empty.In particular, for a class of inverse problems, we have shown that the Gauss-Newton approximation of the Hessian (which is the square of the Jacobian, and is also equal to the full Hessian of the misfit J with noise-free data evaluated at the optimal parameter) is a compact operator [14,15,16], and hence its range space is effectively finite-dimensional.One way to overcome the ill-posedness is to use Tikhonov regularization (see, e.g., [17]), which proposes to augment the cost functional (3) with a quadratic term, i.e., where α is a regularization parameter, R some regularization operator, and • some appropriate norm.This method is a representative of deterministic inverse solution techniques that typically do not take into account the randomness due to measurements and other sources, though one can equip the deterministic solution with a confidence region by post-processing (see, e.g., [18] and references therein).It should be pointed out that if the regularization term is replaced by the Cameron-Martin norm of u (the second term in (10) below), the Tikhonov solution is in fact identical to the maximum a posteriori point in (10).However, such a point estimate is insufficient for the purpose of fully taking the randomness into account.
In this paper, we choose to tackle the ill-posedness using a Bayesian framework [19,20,21,22,23].We seek a statistical description of all possible u that conforms to some prior knowledge and at the same time is consistent with the observations.The Bayesian approach does this by reformulating the inverse problem as a problem in statistical inference, incorporating uncertainties in the observations, the forward map F, and prior information.This approach is appealing since it can incorporate most, if not all, kinds of randomness in a systematic manner.To begin, we postulate a Gaussian measure µ := N (u 0 , C) with mean function u 0 and covariance operator C on u in L 2 (Ω) where with the domain of definition of A defined as Here, H 2 (Ω) is the usual Sobolev space.Assume that the mean function u 0 lives in the Cameron-Martin space of C, then one can show (see, e.g., [22]) that the prior measure µ is well-defined when s > d/2 (d is the spatial dimension), and in that case, any realization from the prior distribution µ almost surely resides in the Hölder space X := C 0,β (Ω) with 0 < β < s/2.That is, µ (X) = 1, and the Bayesian posterior measure ν satisfies the Radon-Nikodym derivative if F is a continuous map from X to R K .Note that the Radon-Nikodym derivative is proportional to the the likelihood defined by The maximum a posteriori (MAP) point (see, e.g., [22,24] for the definition of the MAP point in infinite dimensional settings) is defined as where denotes the weighted L 2 (Ω) norm induced by the . We shall also use •, • L 2 (Ω) to denote the duality pairing on L 2 (Ω).Note that, by definition, the MAP point u M AP is a function in the Cameron-Martin space.
It should be pointed out that the last term in (10) can be considered as a regularization.Indeed, the MAP point can be considered as a solution to a deterministic inverse problem.However, the Bayesian approach provides a more complete picture of the solution.In particular, the posterior, i.e. the Bayesian solution, encodes all possible solutions with associated confidence (probability).The task at hand is to provide not only a solution but also its associated uncertainty.Perhaps the most general tool to accomplish this task is the Markov chain Monte Carlo (MCMC) framework [4] that we now discuss.

Metropolize-then-discretize framework
In this section we present a Metropolize-then-discretize framework in which Markov chain Monte Carlo (MCMC) methods are first introduced on function spaces for the Bayesian posterior (8) and then discretized using finite element method.We restrict ourselves to two classes of MCMC methods: the preconditioned Metropolis-adjusted Langevin algorithm, and the hybrid Monte Carlo (also known as Hamiltonian Monte Carlo) method.

Function-space Metropolis-adjusted Langevin algorithm (FMALA)
To explore the posterior given by the Radon-Nikodym derivative (8) with nonlinear misfit functional J (u) we can use the preconditioned Crank-Nicholson Langevin MCMC algorithm developed in [10,11].The key feature of function-space MCMC methods is that, if properly discretized, the acceptance rate of resulting discretized methods is dimension-independent [10], and hence practical for large-scale inverse problems in which the mesh needs to be refined or adapted.Finite dimensional MCMC methods such as those in [7,8] whose proposal variances (and hence acceptance rate) depend on dimension of a discretization under consideration is unlikely to be useful for our target large-scale applications.Indeed, existing works [7,8] show that the proposal variance must vanish as the parameter dimension approaches infinity for the acceptance rate to be bounded away from zero.The detailed derivation and analysis of the function-space Langevin MCMC method can be consulted in [10].Let us summarize some important points that are needed in this paper.
To define a Metropolis-Hastings MCMC algorithm for the posterior measure ν on L 2 (Ω), following [25], let us start by denoting q (u, dv) as the transition measure on L 2 (Ω) given u.Furthermore, let be a joint measure on the product space L 2 (Ω)×L 2 (Ω).The transpose measure is then defined as η T (du, dv) := η (dv, du).The standard Metropolis-Hastings MCMC [4] can be extended to function space [25] with the following acceptance probability where dη T dη (u, v) is the Radon-Nikodym derivative of η T with respect to η.What remains to be defined is the transition measure q (u, dv) and the explicit form of the acceptance probability (12).Care must be taken in designing q (u, dv) so that η is not singular with respect to η T since on infinite dimensional spaces even Gaussian probability measures are prone to be mutually singular due to the Feldman-Hajek theorem [26].Following [27] we consider the following stochastic PDE with ∇J (u) being the Fréchet derivative of J (u), and b being the nuclear C−Wiener process [26] on L 2 (Ω).Note that the "product" C∇J (u) makes sense since C is an operator from the dual space [L 2 (Ω)] (the space of linear continuous functionals on L 2 (Ω)) to L 2 (Ω), and We conventionally write ∇J (u; z) to denote the action of ∇J (u) on any z ∈ L 2 (Ω).One can show (see, e.g., [27,26]) that ν is the invariant measure of (13).Now discretizing (13) with a preconditioned simplified Crank-Nicholson method [10] with time step ∆t we obtain the proposal v given u as where ξ is distributed by the prior Gaussian measure µ.We define the proposal measure q (u, dv) by ( 14).Then the work in [11] shows that where ρ (u, v) is defined as .

Function-space hybrid Monte Carlo method (FHMC)
In this section, we recall the hybrid Monte Carlo, also known as Hamiltonian Monte Carlo, method.In particular, we follow the Hilbert space setting presented in [12].To begin, let us denote by ∆t the time step, L the number of time steps, t the artificial time, and ϑ the artificial velocity.Consider the following Hamiltonian and the following discrete dynamic which corresponds to integrating the splitting Hamiltonian dynamics [12] du dt = 0, and Since the discrete dynamics (18) only approximates the continuous counterpart (19), the Hamiltonian ( 17) is not preserved, though volume-preserving and reversibility are guaranteed.Thus, one needs to equip the HMC approach with an acceptance/rejection rule to ensure the convergence of the Markov chain.Following [12], we define the acceptance probability as where , with u i , ϑ i as the intermediate steps of the integration scheme (18) defined in Algorithm 1.Note that this algorithm is used to generate one state/sample of the HMC chain, that is, it needs to be executed N s times if N s samples are required.Here, u (0) is chosen as the last sample or as the MAP point for the first sample computation.
Note that the Hamiltonian difference in (21) is not in the standard form.The reason is that the term u, C −1 u L 2 (Ω) (and ϑ, C −1 ϑ L 2 (Ω) ) in the formal definition of the Hamiltoninan ( 17) is almost surely infinite.It is only finite for u in the Cameron-Martin space of C, whose measure is zero under the prior Gaussian measure.Expression (21) takes into account the cancellation of almost surely infinite terms on the continuous level before discretization.As shall be demonstrated by numerical results in Section 5, this is critical to the success of HMC methods in high dimensional parameter spaces.

Adjoint computation of the derivative
As can be seen in the previous sections, both FMALA and FHMC methods require the derivative ∇J (u) of the cost functional (3).In this section, we present an adjoint technique to compute derivative efficiently.We start by considering the weak form of the forward equation (4): with λ as the test function.Using the standard reduced space approach (see, e.g., a general discussion in [28] and a detailed derivation in [29]) one can show that the Fréchet derivative (assumed to exist) ∇J (u; •) of the cost functional J (u) acting in any direction ũ is given by where the adjoint state λ satisfies the adjoint equation with ŵ as the test function.The procedure for computing the gradient acting on an arbitrary direction ũ is now clear.One first solves the forward equation ( 22) for w, then the adjoint (24) for λ, and finally evaluating (23).

Discretization using finite element method
Clearly, for computer implementation, we need to discretize the prior, the likelihood, and hence the posterior.This section presents our contribution to the Metropolize-then-discretize approach, namely the discretization using the finite element method.In particular, we employ the standard H1 (Ω) finite element method (FEM) to discretize the forward and adjoint (the likelihood), and the operator A (the prior) 1 .For convenience, we further assume that the discretized state and parameter reside on the same finite element mesh.Since FEM approximation of elliptic operators is standard (see, e.g., [30]), we will not discuss it here.

Finite element approximation of the prior
In this section, we describe a combination of FEM and the matrix transfer technique (see, e.g.[31] and the references therein, for the detail) to discretize truncated Karhunen-Loève (KL) expansions.Note that this is neither the unique nor the best way to discretize a Gaussian measure.In fact, it is chosen simply for convenience and the readers are referred to, for example, [32,33] for a different approach.Let us start by defining Q := A −s/2 , then the eigenpairs (λ i , v i ) of Q define the KL expansion of the prior distribution as where a i ∼ N (0, 1).The truncated KL expansion for u with N modes, again denoted by u for convenience, reads Clearly, we need to solve the following eigenvalue problem to determine (λ i , v i ): We choose to solve (27) using the matrix transfer technique (MTT).To that end, let us denote by M the mass matrix, and K the stiffness matrix resulting from the discretization of the Laplacian ∆ using the FEM method.The representation of A (see, e.g., [34]) in the finite element space is given by A := M −1 K + I.
If we define (σ i , v i ) as eigenpairs for A, i.e, where Σ is the diagonal matrix with entries σ i , then it is easy to see that with δ ij as the Kronecker delta function.Note that v i is in fact the FEM nodal vector of v i .Since A is similar to , a symmetric positive definite matrix, A has positive eigenvalues.Using MTT method, the matrix representation of (27) reads where It follows that We use the Lagrangian FEM to discretize the prior so that the FEM approximation of the parameter is given by the FEM ansatz where u i are the nodal values, ϕ i the nodal Lagrange basis functions, and N the total number of FEM basis functions.Note that the number of FEM basis functions does not need to be the same as the number of truncated modes in (26), but they are chosen to be the same for sake of convenience.This particular choice implies that more modes (higher parameter dimension) implies finer mesh and vice versa.From now on to the end of the paper, let boldface symbol denote the corresponding vector of FEM nodal values, e.g., u is the vector containing all FEM nodal values, u i , of u.Again, for ease in writing, we have used the same notation u for both infinite dimensional quantity and its FEM approximation. the FEM approximation of each Substituting the FEM ansatz for both u and v j into the truncated KL expansion (26), and using the Galerkin projection we obtain after eliminating the mass matrix M. Here, Λ is an N × N diagonal matrix with entries λ i and a ∼ N (0, I) is a random vector whose components are a i .Since u ∈ L 2 (Ω), u belongs to R N M , i.e. the Euclidean space with weighted inner product Let us emphasize that the mass matrix M, i.e. the discretized Riesz matrix in L 2 (Ω), is the natural weight coming from the L 2 (Ω)-setting of the prior.
A question arises: what is the distribution of the FEM nodal vector u?One can see from (32) that u is a Gaussian since a i are.What remains to be done is to determine the mean and covariance matrix of u.It is easy to see that the mean is u 0 by taking expectation of both sides of (32) with respect to a and using the fact that E a [a i ] = 0. We find the covariance matrix c using definition of random vectors in R N M , i.e., z, cy where the expectation E u and E a are taken with respect to u and a, respectively.Note that we have used (32) to obtain the second equality and the fact that E a aa T = I to reach the third equality.It follows that where we have used (28) to obtain the second equality.A simple algebra manipulation shows that c is a self-adjoint operator from R N M to R N M .The square root of the covariance matrix is thus given by c which is also a self-adjoint operator from R N M to R N M .Again, using the identity (28) one can see that c whence the distribution of u, i.e. the FEM discretization of the prior, is given by that is, the discretized prior is a Gaussian distribution in R N M with mean u 0 and covariance matrix c.

Finite element approximation of the likelihood
The forward and adjoint equations ( 22) and ( 24) are second order elliptic PDEs.As such, they are (perhaps) best approximated with the standard FEM method, so that the forward solution, for example, can be approximated as The FEM approximation for each equation results in a system of algebraic equations whose unknowns are the nodal values, e.g.w for the forward equation.This standard process can be found in [30].Once w is solved for, the misfit , and hence the likelihood (9), is readily available.
We next focus on the discretization of the gradient.We begin with some conventions and definitions.Since u ∈ L 2 (Ω), we follow [35] to define the gradient G (u) ∈ L 2 (Ω) of the cost functional (3) as the Riesz representation of the gradient (23) with respect to the L 2 (Ω) inner product, i.e., We can use FEM to approximate the gradient as with g i as the nodal values of G.Note that g = g (u) = g (u) but for simplicity we will ignore this dependency except where this could be ambiguous.Similar to u, g belongs to R N M .It follows that when u and ũ are restricted to the finite element space we have We have defined the gradient G in L 2 (Ω) (on which the Gaussian prior is postulated).We have also defined its counterpart, i.e. g, in R N M , and their mutual relationships are dictated by the FEM ansatz (39).We shall derive the relationships between the gradient in R N M and the FEM gradient defined in Section 4.2.

Finite element approximation of the posterior
Since the infinite dimensional Bayes formula (8) becomes the standard one [22] when restricted to finite dimensional space, the FEM approximation of the posterior, particularly the posterior density, is therefore given by (41) Since the posterior density is defined on finite dimensional space R N M , it is amenable to computer implementation of MCMC methods.However, care must be taken in designing MCMC methods so that their performance does not deteriorate as the dimension of the discretization increases (e.g. when the mesh is refined).This is desirable, especially for Bayesian inverse problems governed by PDE in which mesh refinement is often needed to resolve important physical features of the problem under consideration.One of the main objectives of this paper is to constructively derive such FEM-based discretization-invariant MCMC methods.
Though the convergence analysis of our finite element approximation scheme is important, it is a subject of future work.On the other hand, it should be mentioned that a convergence analysis for finite dimensional posterior can be found in [22] using spectral method and in [36] using a combination of spectral method and boundary element approach.A different FEM approximation of both prior and likelihood, and hence posterior, can be found in [32] for a particular case of s = 2 in a three dimensional spatial seismic inverse problem.

M
The goal of this section is to derive a FEM-based discretization-invariant MALA algorithm.We shall achieve this objective by using the FEM discretization proposed in Sections 3.4.1,3.4.2,and 3.4.3 to discretize the FMALA discussed in Section 3.1.The task at hand is to carry out a finite element approximation for (14), (15), and (16).The question that needs to be addressed is how to cast ( 14), (15), and ( 16) in terms of quantities in R N M .The non-trivial term is C∇J (u), namely, the action of the prior covariance operator on the gradient.We begin with the definition of the covariance operator acting on z, ∇J (u) ∈ L 2 (Ω): where we have used (38), and the expectation is taken with respect to the random function r distributed by the prior measure µ.Now, using the FEM method to discretize z as in (31), G (u) as in (40), and r as in (32) we obtain where we have used E a aa T = I.Clearly, if z = ϕ i , i.e. the ith FEM nodal basis function, then z is the vector with z i = 1 and z j = 0, ∀j = i.The result in (42) implies the discretization of the square root On the other hand, since v ∈ L 2 (Ω) the proposal equation ( 14) is equivalent to for all φ ∈ L 2 (Ω).Now, discretizing the random function ξ as in (32), restricting v and u in the FEM space as in (31), representing ∇J (u) as in (38) and (40), taking φ as FEM nodal basis functions, and using ( 42) we arrive at a FEM discretization of ( 14): after eliminating the Riesz matrix M. Here, ξ is a random vector given by (32), and hence distributed by the discretized prior (36).Next, using (34) we conclude In order to compute the acceptance probability (12), we need to discretize (16).This can be done by first substituting the FEM ansatz (31) for v, u, second using definition (38) and FEM approximation (40) for ∇J (u; v ± u), and third employing (43) for C 1 2 .Doing so yields Here, by ρ (u, v) and J (u), we mean the substitutions of FEM ansatz (37) for u and v into ρ (u, v) and J (u).Using ( 28) and ( 35) yield The acceptance rate ( 15) can be therefore approximated as Similar to the function space setting in Section 3.1, if we ignore terms involving g in ( 45) and (46), we obtain a FEM approximation of the preconditioned Crank-Nicholson random walk.
We have shown how to constructively discretize the FMALA method using a Galerkin FEM method.In particular, formulas ( 14)-( 16) for FMALA on L 2 (Ω) have been cast into (45)- (47) for discretized FMALA on R N M .Recall that FMALA is a MALA method defined on the Hilbert space L 2 (Ω).A closer comparison between ( 45)-( 47) and ( 14)-( 16) reveals an important result, namely, the discretized MALA is nothing more than an application of the FMALA algorithm to the Hilbert space R N M .To see this, we note that the results ( 14)-( 16) are valid for any Hilbert space.Thus, they are certainly applicable for R N M once all the corresponding quantities such as gradient g, covariance matrix c (and its square root), etc, in R N M are identified, and this has been one of the main focuses done in Sections 3.4.1 and 3.4.2.In particular, replacing v, u, C, ∇J (u) and ξ with v, u, c, g and ξ in ( 14) immediately gives (45).Similarly, using u, v, g and c 1 2 in the place of u, v, ∇J (u) and C 1 2 in ( 16) and replacing the L 2 (Ω)−inner product with the M−inner product we obtain exactly (46).Finally, replacing u, v with u, v we recover (47).
We have constructed a FEM-based MALA method on R N M .As can be seen, it is derived by either discretizing the FMALA on L 2 (Ω) or applying the FMALA to R N M .We refer to this method as R N M -MALA, for which the proposal v given the current state u of the Markov chain is computed using (45) and the acceptance rate is computed via (47).Let us now provide a brief discussion on the discretization-invariant property of R N M -FMALA, and a detailed rigorous analysis is left for future work.The convergence of the distribution of the sequence of random functions with FEM nodal vector ξ to the prior distribution µ is clear by the convergence of FEM [30], of the matrix transfer technique [34], and of the Karhunen-Loève expansion.The convergence of c as a representation of C in R N M is due to the same reasons.The convergence of the function sequences with FEM nodal vectors g to the gradient G is due to the interpolation theory on finite element spaces [30].The convergence of the function sequences with FEM nodal vectors u to u is due to the same reason.Consequently, the function sequences with FEM nodal vectors v in (45) converges to the proposal function v in (14).The convergence of ρ (u, v) to ρ (u, v), and hence the convergence of dη T dη (u, v) to dη T dη (u, v), is due to similar reasons.As shall be shown in the numerical results in Section 5, this is the key ensuring the inheritance of dimension-independent, and thus mesh-independent, property of the R N M -MALA method.

M
In this section, we construct a FEM-based discretization-invariant HMC method in R N M .Similar to Section 3.4.4,we accomplish this goal using a FEM approximation of the FHMC defined on L 2 (Ω) that is described in Section 3.2.The presentation is necessary brief since we use similar Galerkin projection and MTT technique as discussed at length in Section 3.4.4.In particular, the Galerkin FEM discretization of the dynamics (18) where, let us recall, except for c, bold symbols denote vectors of FEM nodal values.
Similarly, a FEM approximation of the acceptance probability (20) can be written as where Here, we have defined g i := g (u i ) , i = 1, . . ., L, and u i is the FEM nodal vector of u i defined in Algorithm 1. Similar to Section 3.4.4we refer to the HMC method defined by ( 48)-(50) and Algorithm 1 with u, ϑ replaced by u, ϑ as R N M -HMC.Note that, due to the Galerkin FEM approximation, we draw ϑ (0) using the truncated KL expansion in (32).As a result, there are two equivalent view points on R N M -HMC: it can be considered as an algorithm resulting from a systematic discretization of the FHMC method on L 2 (Ω) using the FEM method, or it is an algorithm obtained by applying the FHMC method to the weighted Euclidean space R N M .Either of the views requires proper definitions/constructions of the prior, the gradient g, and the covariance matrix c (and its square root) on R N M to which Sections 3.4.1 and 3.4.2are devoted.Numerical results in Section 5 show that R N M -HMC is discretization-invariant and the reasons are similar to those discussed at the end of Section 3.4.4.

Discretize-then-Metropolize approach
We have presented FEM approximations of the prior, the likelihood, and the posterior in Section 3.4.As pointed out, the nodal vector of the parameter, i.e. u, should be considered as a vector in the weighted Euclidean space R N M .This is inherited from the L 2 (Ω)-setting of the prior.In practice, the R N -view of the nodal vector seems to be more natural and "easier" for practitioners.The challenge here is how to construct a discretization-invariant approximation procedure in R N and whether it is equivalent to the R N M -view.The goal of this section is to give detailed answers for these questions.

R N -discretization of the infinite dimensional Bayes' formula
We begin with the formal density of the posterior with respect to the Lebesgue measure as and we wish to discretize the formal density π (u|d).To that end, we need to discretize the likelihood π like (d|u) using the FEM method, and this was already done in Section 3.4.2,since FEM approximation of the likelihood is independent of either of the view points.What remains to be done is to discretize the formal density of the prior, and it is sufficient to consider where we have defined a function z ∈ L 2 (Ω) as Note that (52) is similar to (27) and for that reason we also use the MTT approach as in Section 3.4.1 to discretize (52).Doing so casts (52) into where u, u 0 and z are the FEM nodal vectors of u, u 0 and z, respectively.Consequently, we have where we have used the FEM approximation (29) for A s/2 and ( 28) for simplifying the results 2 .As can be seen, ( 53) coincides with the negative logarithm of the discrete prior probability density of u in (36) using the R N M -view.Recall that we derived (36) using statistics/probability theory while we have used the MTT technique, a deterministic approach, to arrive at (53), and yet they are identical.
We conclude that the FEM approximation of the formal posterior density (51) is exactly the same as (41), i.e., (54) In other words, we obtain identical FEM discretization of the infinite dimensional Bayes' formula using either of the R N M -or the R N -view.It is clear from (54) that the prior distribution in R N is a Gaussian N (u 0 , C) and our next task is to determine C. From (54), we infer from which we can deduce its inverse as where we have used (28) to obtain the last equality.Comparing ( 34) and ( 55) we obtain that is, the covariances in R N and R N M are related but different.It follows that u ∼ N (u 0 , C) in R N .
The next question that we like to address is how to generate random vectors from the prior N (u 0 , C).By inspection we see that, if a ∼ N (0, I) then a random vector v ∈ R N defined by is distributed by the prior N (u 0 , C).As can be seen, ( 57) is exactly the same as (32).The difference here is the interpretation, namely, random vectors generated by (57) belong to R N while those from ( 32) are understood as vectors in R N M .

Gradient in R N
In the following, we will present discretization-invariant MCMC methods in R N to explore the posterior (54).To that end, we need to define the gradient in R N .We begin with the definition of the FEM gradient vector G ∈ R N : i.e. the derivative of the cost functional along direction spanned by the FEM basis ϕ i , which is precisely the gradient of the cost functional J (u) with respect to u i , namely, Again, J and G are a function of u, and we omit their argument for simplicity, but we recover the dependency occasionally if there is a chance of misunderstanding.We will use this convention throughout the paper for other quantities as well.
We have the following simple result relating the gradients in R N and R N M : which can be proved by taking ũ in (38) as ϕ i , for i = 1, . . ., N , and using ( 39) and (58).
Remark 1.Note that the FEM gradient G is typically the output of a FEM computer code.It can be considered as the gradient of the cost functional with respect to u if we view u as a vector in the standard Euclidean space R N .Identity (59) allows us to express the gradient in R N M as a function of its counterpart in R N .

Construction of a discretization-invariant MALA method in R N
At this point, we have defined the posterior (54), the prior covariance C, and the gradient G in R N .These provide sufficient ingredients to define MALA methods on R N .In particular, we are going to present two R N -MALA approaches to explore the posterior.Clearly, one can employ the original MALA algorithm developed in [8], but for the sake of fairness compared to the R N M -MALA let us use the same Crank-Nicholson discretization strategy proposed in [10,11] for the posterior (54).Doing so yields where ζ is distributed by N (u 0 , C).In other words, ζ is distributed exactly by (32), but now understood as a random vector in R N .Thus, the proposal density of v given u reads Alternatively, these results can be considered as an application of the function space MALA approach, i.e.FMALA, to the posterior (54) on the Hilbert space R N .What remains to be defined is the acceptance rate.At this point, there are two obvious approaches.In the first approach, we continue applying the FMALA approach presented in Section 3.1 to R N .In particular, the acceptance rate is computed using expressions ( 12), (15), and ( 16).Nevertheless, similar to Section 3.4.4,ρ (u, v) needs to be computed using appropriate quantities in R N , i.e. using the FEM gradient G and the covariance matrix C. Doing so gives We next show that the R N -MALA method consisting of (60), (62), and (15) is identical to the R N M -MALA approach in Section 3.4.4.Indeed, using identities (59) and (56) we can rewrite (45) which is equivalent to (60) since both ξ and ζ are random vectors distributed by (32).On the other hand, if we use identities ( 59) and ( 56), it is easy to see that ( 46) is identical to (62).Again, the difference here is the interpretation, namely, nodal vectors in the R N M -MALA method belong to R N M , but those from the R N -MALA counterpart reside in R N .
In the second R N -MALA approach, instead of using ( 62) and ( 15) to compute the acceptance probability we use the standard Metropolization: As can be observed, compared to the first R N -MALA approach (or equivalently the R N M -MALA approach in Section 3.4.4), the second R N -MALA approach is the same up to the proposal (60).However, unlike the former, the acceptance rate of the latter in (63) is purely from the R N -view and does not take into account the fact that the finite dimensional posterior (54) is a discretization of an infinite dimensional problem.In particular, while almost surely infinite terms are eliminated in the former approach, they still present in the latter one.Indeed, without loss of generality let us assume u 0 = 0, one can show that Note that terms involving v T C −1 v and u T C −1 u are very large, in fact approach infinity as the parameter dimension increases, (due to the presence of ζ in (60)) if the finite element space does not conform in the Cameron-Martin space.Nevertheless, the last two lines in (64) vanish in exact arithmetic.
While the first R N -MALA approach takes this into account, the second does not.Consequently, the last two lines are not zero (due to round-off errors) and may dominate ρ (u, v) − ρ (v, u).As will be shown in the numerical results, it is these terms that make the acceptance rate of the second approach deteriorate as the dimension of u increases (e.g. when the mesh is refined).

Construction of a discretization-invariant HMC method in R N
In Section 3.4.5 we have presented a FEM discretization of the FHMC method on L 2 (Ω).In this section, we will first discretize the Hamiltonian, and then present two R N -HMC algorithms parallel to the R N -MALA methods in Section 4.3.To begin, using the MTT technique as in (53) and the prior covariance (55) we can discretize the Hamiltonian (17) to obtain In the first R N -HMC algorithm, we apply the FHMC discrete dynamic (18) to the Hamiltonian (65) in the Hilbert space R N .Substituting appropriate quantities, e.g.G and C, of R N in (18) gives The acceptance rate (20), together with (21), when applied to the discrete Hamiltonian (65 where Here, we have defined G i := G (u i ) , i = 1, . . ., L, and u i is the FEM nodal vector of u i defined in Algorithm 1.Now using identities (59) and (56) it is easy to see that (67)-( 68) are identical to ( 49)-(50).That is, the first R N -HMC algorithm is equivalent to the R N M -HMC method.Let us now construct the second R N -HMC approach by using the standard finite dimensional HMC approach for the Hamiltonian (65).For the sake of comparison, instead of considering the standard momentum-based HMC approach [37,38], we use the same velocity-based HMC framework proposed in [12].That is, the same numerical integrator (66) is employed, but the acceptance rate is based on the usual formula As can be observed, the second R N -HMC approach is almost identical to the first one.Indeed the difference is only at the acceptance rate computation.The former suffers from large errors due to approximating terms with large value (in fact infinite as the number of mesh points, and hence parameter dimension, increases), while this does not happen for the latter because large-value terms are already eliminated on the continuous level.As shall be shown in the numerical results, the approximation errors turn out to be detrimental.In particular, they lead to vanishing small acceptance rates for the second R N -HMC approach.
Let us summarize one of the most important results in this paper about the commutativity of the Metropolize-then-discretize and the discretize-then-Metropolize approaches.Here, by commutativity we mean that both approaches yield the same discretization-invariant finite dimensional MCMC methods.
Theorem 1. Suppose proper definitions of the gradients g, G and covariance matrices c, C are provided on R N M and R N , respectively.If function-space MCMC methods are applied to the Hilbert space R N , the discretize-then-Metropolize approaches commute with the Metropolize-then-discretize coun-terparts.In particular, R N -MALA and R N -HMC algorithms are identical to R N M -MALA and R N M -HMC counterparts.Clearly the result also holds true for other function-space MCMC techniques [10,11] though we limit ourselves to the MALA and HMC approaches.The key to the commutativity is to use the correct and well-defined quantities, e.g.gradient, covariance, and inner product, in the setting under consideration.Defining these properly and systematically in the finite element context is one of the main contributions of this paper.More importantly, the commutativity allows one to constructively develop discretization-invariant finite dimensional MCMC methods, such as those in this paper, whose performance is independent of the parameter dimension, and hence the mesh refinement.
Remark 2. It should be pointed out that the commutativity discussed in this paper is not the same as that of discretize-then-optimize and optimize-thendiscretize approaches in the PDE-constrained optimization literature (see, e.g., [39]).The FEM method typically yields the latter for elliptic forward PDEs.The former is completely different, namely, it is solely based first on proper definitions of g, G, c and C, and then on the application of functionspace MCMC methods on the Hilbert spaces R N M and R N .The former is thus guaranteed, irrespective of the discretization methods and PDE types (though we use the FEM method and the forward PDE happens to be elliptic).Indeed, the discretization of the forward PDE has no role in our commutativity results.On the other hand, even when the FEM method is used, our notion of commutativity may not hold and this is what happens for the second R N -MALA and the second R N -HMC approaches.

Numerical Results
In this section we present numerical results to justify our developments in this paper.Recall that we have presented one Metropolize-then-discretize (MTD) method for MALA and one for HMC in Sections 3.4.4and 3.4.5,respectively, but two discretize-then-Metropolize (DTM) ones for either of the approaches in Sections 4.3 and 4.4.Since the first DTM method is the same as the MTD R N M -one, we call either of them as MTD method, and we refer to the second R N -approach as the DTM method.
We will present examples in one and two dimensional spatial spaces.In the first example, we consider the case with Ω = [0, 1] and Γ R = {1}.We take K = 65 synthetic observations at x j = (j − 1)/2 6 , j = 1, . . ., K, and the noise is assumed to be i.i.d. with variance taken as σ = 0.01 max j {w (x j )}.Here, w (x j ) are synthetic observations with the ground truth parameter field u (x) = 0.1 cos (2πx).The inversion task is to recover this ground truth given the noisy data d modeled as in ( 1)- (2).For the numerical results, we take Bi = 0.1 in (4), α = 8, and s = 0.9 in (6).For all MCMC runs, we take the MAP point as the initial state.For all results, we by no means attempt to run MCMC chains until convergence.Instead, we simply take, say 5000, as a representative sample size for numerical illustrations.
We first compare the MTD MALA, the DTM MALA, and the standard MALA [8].For this comparison, we consider three different mesh sizes h = {2 −9 , 2 −8 , 2 −7 }, or equivalently N = {129, 257, 513}.For time step ∆t, we choose the following values {0.1, 0.05, 0.02, 0.01, 0.005, 0.002} for both MTD MALA and DTM MALA approaches, while it is {0.128, 0.064, 0.032, 0.016, 0.08}× 10 −3 for the standard MALA method.For all MCMC runs, after discarding the first 100 samples as burn-ins we take the chain with 5000 samples from which the average acceptance rate is computed.The results for MTD MALA, DTM MALA, and standard MALA are shown in Figures 1(a), 1(b), and 1(c) respectively.As expected, the accep-tance rate of the MTD MALA is independent of the mesh size, and hence the parameter dimension.However, the DTM MALA approach is better than expected.Indeed, it is identical to the MTD MALA counterpart.The reason is that the Cameron-Martin space is H s = H 0.9 which is a superset of the H 1 finite element space.In other words, the finite element space conforms in the Cameron-Martin one, and hence all the terms involving C −1 in (64) are well-defined and their sum is negligible compared to ρ (u, v) − ρ (v, u).This implies that the first two methods are identical.In contrast, the standard MALA approach suffers from mesh refinement, i.e., the acceptance rate decreases.This is consistent with existing results in [10,11,7,8] (in which different discretizations are used).This highlights the need for discretizationinvariant MCMC methods for PDE-constrained Bayesian inverse problems in infinite dimensional parameter spaces.Clearly, for all methods, the acceptance rate, as expected, decreases as the time step ∆t, i.e. the proposal variance, increases.
We now consider HMC methods for the same inverse problem but present the result differently.In particular, we choose a single time step of ∆t = 0.05 with L = 50 time steps, and a wide range of mesh size h = {2 −13 , 2 −12 , 2 −11 , 2 −10 , 2 −9 , 2 −8 , 2 −7 }.That is, the smallest parameter dimension is 129 and the largest is 8193.For all MCMC runs, after discarding the first 100 samples as burn-ins we take the chain with 5000 samples from which the average acceptance rate is computed.In Figure 2 are the average acceptance rates for the MTD HMC, DTM HMC, standard HMC [37,38], and a prior-conditioned standard HMC methods.Note that the last method uses the inverse of the prior covariance matrix as the mass matrix in the kinetic energy term (see, e.g., [38] for the definition of the mass matrix).As can be seen, the acceptance rates for the first two and the last HMC methods are mesh-independent, while it is not for the standard HMC method.This is expected since standard HMC is purely based on the R N -view, and hence its performance deteriorate as the parameter dimension increases.
Note that the cost of the four methods are comparable since the action of the prior is negligible in this example.The results therefore seem to suggest that the first two and the last methods are the method of choice.Among these three methods, the dimension-independent behavior of the MTD HMC is expected and is consistent with [12].However, the DTM HMC approach is better than expected.Indeed, it is identical to the MTD HMC one.The reason is that the Cameron-Martin space is H s = H 0.9 which is a superset of the H 1 finite element space.In other words, the finite element space conforms in the Cameron-Martin one, and hence all the terms in the formal Hamiltonian (17) are well-defined and finite.This implies that the standard Hamiltonian difference used in (69) is (numerically) the same as the improved one in (21).Thus, the first (MTD HMC) and second (DTM HMC) R N -HMC methods behave the same in this case.The prior-conditioned standard HMC is surprisingly good.Average acceptance rate for various HMC approaches as a function of the mesh size (parameter dimension): Metropolize-then-discretize HMC (Hilbertian HMC with discretized infinite dimensional acceptance rate), discretize-then-Metropolize HMC (Hilbertian HMC with the standard finite dimensional acceptance rate), prior-preconditioned standard HMC, and the standard HMC with identity mass matrix.
To see whether the above results for MTD and DTM approaches are carried over to higher dimensional spatial space, we consider an example in two dimensions.In this case, the forward equation ( 4) is still the same, but the domain Ω together with its dimensions is given in Figure 3(a).Red circles are the measurement/observation points.We will consider three mesh sizes {h, h/2, h/4} and the coarsest mesh with size h is shown in Figure 3(a).The coarsest mesh has 1333 parameters, medium 4760, and fine 17899.Here, Γ R = [−0.5,0.5]×{0}, s = 1.4,Bi = 0.1, α = 5, and σ = 0.01 max j {w (x j )}.The exact synthetic parameter field is a Gaussian blob given by Let us first compare MTD MALA and DTM MALA.The time step is chosen to be ∆t = 0.00015 for both approaches.In Table 1 are the average acceptance rates over 5000 samples for different mesh sizes.As can be seen, the acceptance rate of the former is mesh-independent while that of the latter is not.The reason is the following.For the Gaussian measure to be welldefined we have chosen s = 1.4 > d/2 = 1.Thus, the H 1 finite element space does not conform in the Cameron-Martin space H s = H 1.4 , and terms such as 64) are very large.Round-off errors from these terms dominate ρ (u, v) − ρ (v, u), and this destroys the Metropolization step.To demonstrate this problem let us define for the MTD MALA approach and for the DTM MALA counterpart.The difference between these two are terms involving C −1 in (64).Again, the difference is zero in exact arithmetic.Figure 4 shows the plot of κ for the coarsest mesh case in Figure 3(a).As can be seen, κ of the DTM MALA method is more than three orders of magnitude larger than that of the MTD MALA approach.In other words, the round-off errors from large terms involving C −1 entirely overwhelm ρ (u, v)− ρ (v, u).Consequently, the acceptance rate in the third column of Table 1 is completely erratic.This is the reason why the DTM MALA method fails.We next compare the average acceptance rate of MTD HMC, DTM HMC, and the prior-conditioned standard HMC approaches as the mesh is refined.To that end, we take ∆t = 0.015, L = 100, and generate 1000 samples for each method and each mesh after discarding the first 100 samples as burnins.The results for three methods are shown in Table 2.As can be seen, the acceptance rate of the prior-preconditioned standard HMC is better than that of the DTM HMC, though it degrades as the mesh is refined.The DTM HMC seems to be mesh-dependent, but the acceptance rate is vanishingly small.The MTD HMC is the best in the sense that it not only is independent of the mesh, but also has very high acceptance rate.
To gain further insights into each method, let us present the sample mean with 5000 samples.The result is shown in Figure 5 on the same color scale for all three methods on the coarsest mesh.As can be observed, the sample means of MTD HMC and prior-preconditioned standard HMC approaches in Figures 5(a) and 5(c), respectively, are similar, and they both reveal the position of the Gaussian blob at (0, 2) (compared to the synthetic ground truth parameter field in Figure 3(b)).The DTM HMC method, however, does not seem to succeed in exploring the posterior, and its sample mean in Figure 5(b) is completely random.
Let us now explain why the DTM HMC approach fails in this case.For  Table 2: Average acceptance rate of the Metropolize-then-discretize (MTD) HMC, the discretize-then-Metropolize (DTM) HMC, and the prior-conditioned standard HMC as the mesh is refined.the Gaussian measure to be well-defined we have chosen s = 1.4 > d/2 = 1.Thus, the H 1 finite element space does not conform in the Cameron-Martin space H s = H 1.4 , and term such as u, C −1 u L 2 (Ω) in the formal Hamiltonian (17) is almost surely infinite.Finite element approximation of these terms incurs large errors (which are supposed to be zero) that dominate the actual Hamiltonian difference, and this ruins the Metropolization step.

MTD HMC DTM HMC prior-preconditioned
To demonstrate this problem, let us plot, for the coarsest mesh case in Figure 5, the Hamiltonian difference using the standard and improved approaches in Figure 6.As can be seen, the standard Hamiltonian difference is more than three orders of magnitude larger than the improved one (the actual difference), and the decision on whether to accept a proposal is completely wrong.
We conclude that standard MCMC methods are not discretization-invariant.On the other hand, straightforward approaches such as the DTM methods considered in this section are corrupted by round-off/approximation errors due to theoretically infinite terms.The MTD approaches take this into account and provide methods that are independent of parameter dimensions.

Conclusions and future work
We have present two FEM-based discretization-invariant MCMC methods for PDE-constrained Bayesian inverse problems in infinite dimensional parameter spaces.These result from a systematic FEM discretization of the prior, the likelihood, the posterior, and function-space MALA and HMC algorithms.We have shown that they can be also considered as the result of applying the function-space MALA and HMC methods to the Hilbert spaces R N M and R N .The key that enables our achievements is the proper definition of quantities such as gradient and covariance matrix in these finite dimensional spaces, and we have provided step-by-step derivations using the standard finite element method.Here, by discretization-invariant MCMC methods we mean those whose behavior such as acceptance rate does not deteriorate as the parameter dimension increases.This is of paramount important since mesh refinement, and hence increasingly higher parameter dimension, is often needed to resolve important features of physical phenomena modeled by partial differential equations.
We have considered two frameworks: Metropolize-then-discretize and discretizethen-Metropolize.The former refers to the method of first proposing a function-space MCMC method for PDE-constrained Bayesian inverse problem in infinite dimensional parameter space and then discretizing both of them.The latter, on the other hand, first discretizes the Bayesian inverse problem and then proposes MCMC methods for the resulting finite dimensional discretized Bayesian posterior.In general, these two frameworks do not commute, that is, the resulting finite dimensional algorithms are not identical.The discretization step of the former may not be trivial, while the latter, as has been shown, may not be dimension-independent using traditional or straightforward approaches.In this paper, we develop finite element discretization schemes for both frameworks so that both commutativity and discretization-invariant are attained.In particular, we show how to construct discretize-then-Metropolize approaches for both Metropolisadjusted Langevin algorithm and hybrid Monte Carlo method that commute with their Metropolize-then-discretize counterparts.The implication is that practitioners can take advantage of the developments in this paper to straightforwardly construct discretization-invariant discretize-then-Metropolize MCMC for large-scale inverse problems.Numerical results for one-and two-dimensional elliptic inverse problems with up to 17899 parameters have been presented to support our developments.
Ongoing research is to develop mesh-independent Riemann manifold Hamiltonian MCMC methods.This class of methods has been shown to be every effective [40,41] with very high acceptance rate and with almost uncorrelated samples.The work on Hessian in the Appendix will be useful for these type of methods in which the Hessian is the key player.We are also investigating rigorous convergence analysis for the proposed discretization-invariant MCMC methods using ideas from [22,36].
From (73) and (39), we deduce the following relationship between H and h when restricted to finite element space: Similarly, we define the FEM Hessian H as that is, the second derivative of the cost functional along directions spanned by the FEM basis functions ϕ i and ϕ j , and we deduce

1 2
= c.Now, the inverse of c reads c

Figure 1 :
Figure 1: Average acceptance rate for Metropolize-then-Discretize MALA, Discretize-then-Metropolize MALA, and standard MA LA as the mesh is refined.

Figure 2 :
Figure2: Average acceptance rate for various HMC approaches as a function of the mesh size (parameter dimension): Metropolize-then-discretize HMC (Hilbertian HMC with discretized infinite dimensional acceptance rate), discretize-then-Metropolize HMC (Hilbertian HMC with the standard finite dimensional acceptance rate), prior-preconditioned standard HMC, and the standard HMC with identity mass matrix.

Figure 3 :
Figure 3: Left: finite element coarsest mesh of the thermal fin and its dimensions.Red circles are the measurement/observation points.Right: the exact synthetic parameter field u

Figure 4 :
Figure 4: A plot of κ for both MTD MALA and DTM MALA approaches on the coarsest mesh.

Remark 3 .
Note that the FEM gradient G and Hessian H are typically the outputs of a FEM computer code.They can be considered as the gradient and Hessian of the cost functional with respect to u if u is viewed as a vector in the standard Euclidean space R N .Expressions (59) and (75) allow us to express the gradient and Hessian in R N M as a function of their counterparts in R N .

Figure 6 :
Figure 6: Standard versus improved Hamiltonian difference for two dimensional spatial problem.

Table 1 :
Average acceptance rate of the Metropolize-then-discretize MALA and the discretize-then-Metropolize MALA as the mesh is refined.