Hyperpriors for Mat\'ern fields with applications in Bayesian inversion

We introduce non-stationary Mat\'ern field priors with stochastic partial differential equations, and construct correlation length-scaling with hyperpriors. We model both the hyperprior and the Mat\'ern prior as continuous-parameter random fields. As hypermodels, we use Cauchy and Gaussian random fields, which we map suitably to a desired correlation length-scaling range. For computations, we discretise the models with finite difference methods. We consider the convergence of the discretised prior and posterior to the discretisation limit. We apply the developed methodology to certain interpolation and numerical differentiation problems, and show numerically that we can make Bayesian inversion which promotes competing constraints of smoothness and edge-preservation. For computing the conditional mean estimator of the posterior distribution, we use a combination of Gibbs and Metropolis-within-Gibbs sampling algorithms.


Introduction
In many Bayesian statistical estimation algorithms, the objective is to explore the posterior distribution of a continuous-parameter unknown v(x), x ∈ R d , d = 1, 2, . . .given a direct or indirect noisy realisation y ∈ R M of the unknown v at fixed locations.Bayesian estimation algorithms are widely used for example in spatial statistics, Machine Learning and Bayesian statistical inverse problems and specific applications include e.g.remote sensing, medical imaging and ground prospecting [1,4,13,16,21,25].
A priori information is a key factor in any Bayesian statistical estimation algorithm, as it is used to stabilise the posterior distribution.Gaussian processes and fields are common choices as priors, because of their analytical and computational properties, and because of their easy construction through the mean and covariance functions.
Gaussian priors are known to promote smooth estimates [13].However, the smoothness of the unknown is inherently problem-specific.For example in many atmospheric remote sensing algorithms, the unknown is often assumed to be continuous [18].On the other hand, in many subsurface imaging or in medical tomography applications, the unknown may have anisotropies, inhomogeneities, and even discontinuities [8].
A common method for expanding prior models outside the scope of Gaussian distributions is to apply hyperparametric models [5,6].In this paper, we study hypermodels for inhomogeneous Matérn fields, and our specific objective is to demonstrate that the proposed priors are flexible enough to promote both smooth and edge-preserving estimates.
Matérn fields, a class of Gaussian Markov random fields [16,21,24], are often defined as stationary Gaussian random fields, i.e. with them we can model isotropic unknown.By a simple change of variables, these fields can model also anisotropic features.Via modelling the Matérn fields with stochastic partial differential equations and by locally defining the correlation parameters, we can even construct inhomogeneous random fields [16,22].In addition, we can make them computationally very efficient, as the finite-dimensional approximation of the inverse covariance matrix, the precision matrix, is sparse by construction.
Let us denote by v N the finite-dimensional approximation of the continuousparameter random field v.The solution of a Bayesian estimation problem is a so-called posterior distribution.We give it as an unnormalised probability density where the likelihood density D y|v N is obtained e.g. through some physical observation system and the a priori density D v N reflects our information of the unknown object before any actual measurement is done.We take v N to be an approximation of a Matérn field.D(y) is a normalisation constant, which we often can omit from the analysis.We model the Matérn field length-scaling as a continuous-parameter random field (x), i.e. we construct a prior for certain parameters of the prior D(v N ).By denoting the discrete approximation of the continuous-parameter field by N , we may include the hyperprior into the posterior distribution (1), and hence write the posterior probability density as where D v N , N is the prior constructed from the hyperprior D N and the prior itself is D v N | N .The idea is similar to the length-scale modelling in [19,20], but we will use using stochastic partial differential equations and sparse matrices, hence gaining computational advantages.
To simplify the sampling of , we introduce an auxiliary random field u, and apply a further parametrisation = (x; u).We choose to model u as a continuous-parameter Cauchy or Gaussian random field, but it could be also something else, for example an α-stable random field [17,26].In this paper, we will put an emphasis on the discretisation of these hypermodels and the convergence of the discrete models to continuous models.From a computational point of view, we need to discuss Markov chain Monte Carlo (MCMC) methods, and in particular we will consider Gibbs and Metropolis-within-Gibbs sampling.
Modelling non-stationary Gaussian random fields and using them in e.g.interpolation is not a new idea, see for example Fuglstad et al. 2015 [9] for a comprehensive reference list.Other constructions of non-stationary Gaussian processes, given lately notable attention, include e.g.so-called deep learning algorithms, where an n-layer Gaussian process is formed in such a way that the deepest level is a stationary Gaussian process.These methods produce interesting priors from an engineering perspective.These methods, however, are typically lacking rigorous mathematical analysis [19,20].Also, many deep learning methods typically rely on full covariance matrices, hence computational burden might become a bottleneck, especially in high-dimensional problems.
We note that discontinuities are best modelled with specific non-Gaussian prior constructions.A common choice is a total variation prior, which promotes edge-preserving estimates.However, total variation priors are well-known to behave as Gaussian smoothness priors when the discretisation is made denser and denser (Lassas and Siltanen 2004) [15].Constructions of proposed non-Gaussian priors, which do not converge to Gaussian fields in the discretisation limit, include the Besov space priors (Lassas et al. 2009) [14], which are constructed on a wavelet basis, hierarchical Mumford-Shah priors (Helin and Lassas, 2011) [11], and recently applied Cauchy priors (Markkanen et al. 2016) [17].Construction of a Matérn style random field with non-Gaussian noise has been studied by Bolin 2014 [3].These algorithms may work suitably well for example for edgepreserving Bayesian inversion, but they may lack the smoothness or limiting properties, which we aim to deploy in the proposed hypermodel construction.
The rest of this paper is organised as follows: In Section 2, we review the basics of linear Bayesian statistical estimation algorithms.In Section 3, we discuss continuous Matérn fields and construct certain hypermodels.In Section 4, we consider discretisation of the hypermodels, and, convergence of the discretised models to the continuous models.We consider rough hypermodels in Section 5, and discuss Gibbs and Metropolis-within-Gibbs algorithms in Section 6.In Section 7, we show by a number of numerical examples how to use the constructed model in interpolation algorithms.

Linear Bayesian estimation problems
Let us consider a continuous-parameter linear statistical estimation problem We assume that we know exactly one realisation of y and the linear mapping A from some function space (e.g.separable Banach space) to a finite-dimensional space R M .We further assume that we know the statistical properties of the noise e.From now on, we assume that e is zero-mean Gaussian white noise statistically independent of v.We emphasise that we do not know the realisation of the noise e.Given these assumptions, our objective is to estimate the posterior distribution of v.For computational Bayesian statistical estimation problems, we discretise Equation ( 2), and write it as a matrix equation Likelihood probability, a factor of the posterior distribution (see Equation ( 1)), can then be given as where the Gaussian measurement noise e ∼ N (0, Σ), and Σ is noise covariance matrix.
We start the preparations for the numerical sampling of the posterior for the hyperprior.First, we will consider the case when the unknown v N , conditioned with the hyperparameter N , has Gaussian distribution N (0, C).
We can write the prior as a normalised probability density Here we have included the normalisation constant, as we aim to include the hyperparameters of v N into the matrix C, i.e. |C| = |C( N )| is not a constant, for the different values of the hyperparameter, unlike the normalisation constant |Σ| in the likelihood density.Our aim is to decompose the prior inverse covariance, i.e. precision matrix, as C( N ) −1 = (L( N )) T L( N ).This means that, similarly to Equation (2), we can present the prior also as an observation equation L( N )v N = −w N , where w N ∼ N (0, I).Hence, we can form a stacked matrix equation The benefit of this formulation is that we can easily make Gibbs sampling of the posterior distribution with this formulation, and hence to compute the posterior mean We can write similarly the estimate for the variable length-scale For the estimation of N , we use the so-called Metropolis-within-Gibbs algorithm [17], where we make Gibbs type sampling, but component-wise, we make Metropolis-Hastings algorithm.Alternatives for Metropolis-within-Gibbs include e.g.pseudo-marginal approach to MCMC studied by Filippone and Girolami 2014 [10].The estimation of v N can be used with standard Gibbs sampling techniques.

Matérn field priors and hyperpriors
Matérn fields are often defined as stationary Gaussian random field with a covariance function where ν > 0 is the smoothness parameter, and K ν is modified Bessel function of the second kind or order ν.The parameter is called length-scaling.Correlation length, that is to say where correlation is 0.1, corresponds approximately δ = √ 8ν.From now on, we suppose that the Matérn fields have a zero mean.Let us recall the stochastic partial differential equation for the Matérn fields [16,21].The Fourier transform of the covariance function in Equation (4), gives a power spectrum .
As first mentioned by Rozanov 1977 [23], only fields with spectral density given by the reciprocal of a polynomial have a Markov representation.For our applications, we fix ν = 2 − d/2.
If we let w be white noise, and by using 'hat'-notation for a Fouriertransformed object, then we may define the basic Matérn field v through the equation v = σ S(ξ) w in the sense of distributions.By using inverse Fourier transforms, we may write a stochastic partial differential equation Here we have an elliptic operator equation.We note that the constructed field v is isotropic, i.e. the field has constant correlation length-scaling to every coordinate direction.
We modify the isotropic formulation to be inhomogeneous by allowing a spatially variable length-scaling field (x), for which we write a stochastic partial differential equation ( In order to have a well-defined elliptic equation, we require that inf x∈D (x) > 0. The condition will be fulfilled with the help of an auxiliary transformation.Moreover, needs to be regular enough.We consider the cases where has L ∞ (D)-sample paths.In addition, we consider the problems that arise for rougher sample paths.We will construct the (x)-model in the following sections.
Let us consider now the discretisation of Equation ( 5).It is well-known that white noise can be seen as a distributional derivative of the Brownian sheet B.Moreover, the measurable linear functionals of white noise can be identified with stochastic integrals.Hence, it is natural to discretise white noise as where where h is the discretisation step, which we choose to be same along all the coordinate directions.
The only thing we have left to discretise in Equation ( 5), is the operator part, for which we can use any standard finite difference methods.Hence, we can write e.g. the one-dimensional discretisation of (5) as where jh ∈ hZ is the discretisation lattice, and j := (jh).This model can be given as a matrix equation L( N )v N = w N , where L( N ) is a symmetric sparse matrix.

Hypermodels
Let us consider modelling the length-scaling (x; u) with Gaussian random fields.
To achieve an elliptic equation, we apply a log-normal model where u is a Gaussian random field.For the discrete model on lattice points x = jh, we set similarly N j = exp u N j .For example, we may choose another Matérn field as a hypermodel for u.Hence, let u be a zero-mean Matérn field, with constant length-scaling 0 , and consider its discretisation u N by (7).Then u N has covariance matrix C N and we write the discrete hypermodel as

Discretisation and convergence of the hypermodel
Let us now choose a Matérn field hyperprior for u as well as Matérn prior for v, and use Equation ( 8) for length-scaling .In this section, we will first consider discretisation and convergence of this hypermodel, and then make some notes, when we modify the hypermodel to have Cauchy walk hyperprior.
As above, let the realisations of the random field u be tempered distributions that satisfy where w is S (R d )-valued Gaussian white noise on R d and 0 , σ 0 > 0 are given constants.We assume that w is statistically independent from w.
It is easy to verify that u is a measurable transformation of w and it has almost surely continuous sample paths for d = 1, 2. Hence the right hand side of ( 5) is a well-defined generalised random field with distribution on Borel fields A (with respect to the weak * -topology) of S (R d ).
In the first step, we approximate the random field v without approximating u or .We discretised the Laplacian with finite differences in (5) and, furthermore, discretised the white noise w.
Let us define a suitable mesh space, where the convergence is studied.We equip the space of all real-valued functions v N on the mesh .
We will use the following notations: we denote by B a suitable boundary operator that stands e.g. for the periodic or the Dirichlet boundary condition.In the next theorem w N = T h w is the Steklov mollified version of the white noise.That is, the radonifying transformation T h is defined as where and Lemma 4.1.Let u be a Gaussian random field that satisfies (9).
with the periodic boundary condition, where (x; u) = g(u(x)) and on hZ d ∩ D, with the boundary condition Bv N = 0 on hZ d ∩ ∂D.Then L 2 (L 2 (D h ), P )-norm of v N − v converges to zero as h → 0.
Proof.Conditioning with u inside L 2 (L 2 (D h ), P )-norm gives us Recall that u is a radonifying transformation of w, where w is statistically independent from w N .Then also u and w N are statistically independent.Moreover, v N is a Carathéodory function of (u, w N ) (which is radonifying with respect to the second variable).This means that conditioning v N with u = u 0 , where u 0 ∈ C(D), only replaces the random coefficient (•; u) in ( 11) with a fixed continuous function (•; u 0 ).The same holds for v N , w N replaced with v, w.
Let us denote with v(•; u, f ) and v N (•; u, f ) the solutions of ( 10) and ( 11), respectively, when the white noise load w is replaced with a function f ∈ L 2 (D).
By applying adjoints of the solution operators, it is easy to verify that due to linearity of the elliptic problem.
By the usual convergence results for the finite-difference scheme (see p. 214 in [12], with straightforward changes for the periodic case), we obtain By inserting elliptic estimates (see Lemma 4.2 below) into (12), we get the upper bound where the constant C u ∈ L 2 (P ).
Remark 1.The above lemma can be easily generalised for the case when u has almost surely bounded sample paths and g is bounded from above and below with positive constants.Hence, we obtain similarly the convergence for the Cauchy walk.
For completeness, we recall the following elliptic estimate.
, where the constant C ∈ L p (P ) for all p ≥ 1.
Proof.Take σ 0 = 1 for simplicity.Let with periodic boundary conditions.Let us write a corresponding integral equation with the help of the operator With Fourier techniques on the torus, we can show that Moreover, the norm of the mapping is bounded by the maximum of 1 and c −1 0 .Therefore, by Babuŝka-Lax-Milgram theorem.The multipliers of the norm belong to L p (P ) for all p ≥ 1 [7].
Next, also Equation ( 9) is discretised on the equidistant mesh hZ d by finite differences and discretisation of the white noise (6).We further modify the equation hk) by expressing the discrete white noise w N as the measurable transformation w N (hk) = (T h w)(hk) of the continuous-parameter white noise (for details on measurable transformations, see [2]).
with the periodic boundary condition where (x; u) = g(u(x)) and Proof.Consider the norm where v(•; u N ) solves ( 13) for some continuous pointwise convergent interpolation of u N in place of u.By Lemma 4.1, the first term of ( 14) vanishes when h → 0. We show that the second term of ( 14) vanishes as h → 0. Indeed, If we can show that v(x; u N , f ) − v(x; u, f ) converges to zero uniformly with respect to x and f , we are done.The term can be considered with the help of Sobolev spaces We denote the Green's operator for (x; u) −2 − ∆ with G (•;u) and for .

By using integral equation techniques, we can show that
are uniformly bounded with respect to N , and the upper bound belongs to L 2 (P ).The uniform boundedness of sup x∈D (x; u N ) follows from Sobolev space estimates and convergence of u N to u in discrete Sobolev norms.Hence, the second term converges.The first term converges similarly since Remark 2. We can show that the Steklov mollification can be replaced with the weaker mollification (6) with similar technique as in the proof of Theorem 4.3.
Remark 3. Theorem 4.3 holds also for the Cauchy walk, when is uniformly bounded from above and below.

Rough hypermodels
In addition to the models mentioned above, we may wish to use a Cauchy noise, a Cauchy walk, or, white noise, which have rougher sample paths than the previously discussed examples.For the discrete white noise, see Equation ( 6).
Let us consider a one-dimensional Cauchy walk and its discretisation [17].The Cauchy walk u(x) is an α-stable Lévy motion with α = 1 defined by where M (A) has Cauchy probability density function for all Borel sets A ⊂ R + .We denote u(x) ∼ Cauchy(|x|, 0).The Cauchy walk is right-continuous and has finite left limits.The discretisation of Cauchy walk is based on independent increments u(hj) − u(h(j − 1)) ∼ Cauchy(h, 0).We note that (x; u) is not stationary.To control the length-scaling in the elliptic equation ( 5), we make a transformation (x; u) = g(u(x)), where and d > 0 is a fixed small constant, and a, b, c > 0 are suitably chosen constants.The corresponding discretised hypermodel can then be constructed as We note again that we have included the normalisation constant as L( N )-matrix depends on the length-scaling N , i.e. it is not a constant.Similarly, we could use discrete Cauchy noise, which is an iid process with rougher sample paths.For discussion of the Cauchy noise for Bayesian statistical inverse problems, see Sullivan 2016 [26].Using again Equation (15), discrete Cauchy noise u N has typically values around zero, or 'sporadically' some big values.Hence, N is typically either long, or 'sporadically' short, i.e. the chosen hypermodel promotes either long or short length-scaling.We note that transformation ( 15) is constructed to be symmetric with respect to zero.

Realisations
In Figure 1, we have plotted realisations of Cauchy and Gaussian process hyperparameters N ω , the resulting covariance matrices and realisation of v N .The realisations v N ω clearly have non-stationary features, parts where we have highly oscillatory features and edges, and then smoother parts.In the Bayesian inversion analysis itself, i.e. in the posterior distribution, the N is a parameter vector to be estimated, and hence to find where the non-stationarities, like the edges are located.
In Figure 2, we have realisations of the constructed two-dimensional hypermodels.As hyperprior realisations, we have a constant-parameter Matérn field realisation, and, an inhomogeneous Matérn field realisations obtained by using different values of length-scaling and tilt-angle in different parts of the domain.
Here we have used an extended domain when constructing the realisation, and in order to remove the boundary effects, we have cropped the images suitably.By varying the Matérn field models, we have three different length-scaling fields.These fields are then used as input for g(s), and in the bottom row, we have plotted realisations of the prior.In the bottom panel, second image from the right, we have used (u(x)) = 1 (x) = 2 2 (x) and non-zero constant tilt-angle theta.In this way, we can make also anisotropic features.

MCMC
In order to draw estimates from the posterior distribution, we will use a combination of Gibbs sampling and Metropolis-within-Gibbs algorithms.We summarise the algorithm as follows: 1. Initiate v N,(0) and N,(0) .2. For k = 1 . . .K (a) Update v N,(k) given fixed N,(k−1) and draw η ∼ N (0, I), and set where † denotes the matrix pseudoinverse.  ) T , and accept with probability Metropolis-within-Gibbs is explained for example in [8,17].The latter uses the term single-component Metropolis-Hastings to emphasise the fact that we sample every single component separately with the Metropolis-Hasting algorithm.We aim at acceptance ratio between 25-50 per cent, which is obtained by tuning the random walk proposal process.
In computing acceptance probability, it is a common practice, due to numerical reasons, to take logarithms instead of using ratios.For example, in the case of the Gaussian hyperprior, the logarithm of the posterior is where R is some constant, which we may omit from the analysis.We note that the normalisation constant computation, i.e. logarithmic determinant log(|L( N )|)), is a numerically unstable and computationally expensive operation, especially in higher dimensions.We need to compute altogether N × K logarithmic determinants in our estimation algorithm, hence we may wish to minimise the log-determinant computation time.We note that in the Metropolis-Hastings part, when updating N,k n , we are actually computing ratio of the proposed and old normalisation constant.Let us denote the proposed and old covariances by C old and C prop , respectively.Then we should actually calculate the ratio, as originally in Equation ( 16), and not take logarithms.Then by simple algebra we have Now, we note that as we update only one row at the time, the matrix-inversematrix-product is of the form: Hence, the product is diagonal, except for the n (th) updated row.This means that we simply need to compute diagonal alue of the updated row, i.e.only one value.It would seem that we would need to invert whole matrix L −1 old .However, we note that as L-matrices are sparse, so we will have only a limited amount of non-zero values.Consider e.g. the one-dimensional case.We could have e.g.
where a and b are constants derived from the approximations in Equation (7).By simple matrix operations and removing the 'zeroes' from the matrix equation, we can rewrite this as where Lold is a 3 × 3 matrix.Hence, the computation of the determinant is then simply inverting a 3 × 3 matrix and making one matrix-vector multiplication.
In two-dimensional problems, we need to invert 5 × 5 matrix.

Numerical examples
Now, we shall apply the developed methodology to one-dimensional interpolation and numerical differentiation, and, to two-dimensional interpolation.

One-dimensional interpolation
We model discrete noise-perturbed observations of a continuous object v as , where e(jh) is zero-mean white noise with known variance, and j ∈ J ⊂ Z is the measurement mesh and j ∈ J ⊂ Z is the mesh of the discretised unknown v N .The discretisation steps h, h > 0. This model, can be rewritten in the form given in Equation (3), i.e. as y = Av N + e. Hence we can write the whole posterior distribution with the help of hypermodels as discussed earlier.
Let us consider v consisting of a C ∞ mollifier function and two boxcar functions This function has smooth parts, edges, and it is also piecewise constant for x ∈ [5,10].In Figure 3, we have simulation results with three different prior and hypermodels: 1. Constant-parameter Matérn prior, i.e. this model does not have hyperprior.

Hypermodel with stationary Gaussian zero-mean process u with exponential covariance
The domain for both the measurements y and unknown v N is [0, 10].The measurement mesh is j = {0, 1, . . ., 80}, h = 1/8 and the unknown mesh j = {0, 1, . . ., 160}, h = 1/16.Zero-mean measurement noise has standard deviation σ = 0.1.Matérn prior has periodic boundary conditions.With the constant-parameter Matérn prior, we have plotted estimates with a long length-scaling (D), length-scaling minimising maximum absolute error (G), and length-scaling minimising root mean square error (J).These estimates capture the smoothness or edges, but not both at the same time.With the Cauchy and Gaussian hypermodels, the algorithm finds short and long lengthscaling N (subfigures (B) and (C)).Also, the corresponding v N estimates in (E) and (F) show that we can reconstruct both smooth and edge-preserving parts.In subfigures (H) and (I), we have plotted v N on the measurement mesh, and in subfigures (K) and (L) in the interpolated points, i.e. between the measurement grid points.This shows that the interpolated estimates are behaving as expected.
In order to relate this study to Paciorek's 2003 study [19], we note that the (D,G,J) subfigures correspond to Paciorek's so-called single-layer model.For deep learning, one should build a series of hyperpriors over hyperpriors.Here, instead, we have a two-layer model, and we may note that it captures different properties with very good precision.Hence, the question remains whether two layers is actually often enough in deep Gaussian processes, and what is the actual gain using deep layers.We will leave this question open, and hope to address that in subsequent studies.
In Figures 4 and 5, we study the behaviour of the N and v N , when N changes, i.e. we numerically study discretisation-invariance. We choose N = 81, 161, 321.The Cauchy and Gaussian hypermodels, as well as the forward theory, are the same as in example in Figure 3.As we have constructed the hypermodels based on continuous-parameter processes, we assume that the finite-dimensional estimates essentially look the same.This was covered theoretically in Section 4. This behaviour can be verified from all the v N estimates visually rather easily, but the N are not as well behaving.The reason is mostly due to too short chains, but already with the chains here, with K = 100, 000, the essential features are in practice rather similar.
We have also plotted the MCMC chains and cumulative means for 15 (th) and 66 (th) elements of N and v N with N = 81.The elements are chosen in  such a way the the 15 (th) element is on a smoothly varying part of the unknown, hence long length-scaling.Around 66 (th) element, we expect to detect an edge, hence short length-scaling.The chains show both good mixing and convergence to the values expected.We use 50, 000 as burning period.As can be seen from the figures, we could use much shorter burning periods and chains, but here our objective is simply to demonstrate the estimators, not optimisation of the chains.Hence, we leave optimisation of MCMC chains for future studies.

Multimodal posterior densities
Let us now consider posterior densities of v N j .If we use the Gaussian hypermodel, and use exponentiating u N (Equation ( 8)), it is easy to show numerically that the posterior densities of v N are Gaussian.It would be tempting to model longest and shortest length-scaling, i.e. a priori lower and upper bounds for N .One such model, could be given as where a > 0, b ∈ (0, 1) are some constants, and P upper > P lower > 0. It is is convenient to model the lower bound as (0) = 1 − b = P lower .However, using this model leads to multimodal posterior densities due to the max-min cutoff, especially at the jumps.
In Figure 6, we have considered the same Gaussian hypermodel and interpolation problem as in Figure 3.Given the MCMC chains, we compute kernel density estimates of of the posterior densities at the jump at x ≈ 8.This corresponds to grid elements at j = 129, 130, 131, where we have a jump from +1 to −1.The densities at grid elements 129 and 131 are Gaussian.However, the at grid element 130, the density is trimodal.We have plotted also the Gaussian density estimate with dashed line, and clearly it fails to capture the trimodality.The reason for the trimodal density is, assumably, in the non-linear transformation of Equation (18).Hence, the algorithm does detect edges, but we need to be careful when assessing the multimodality of the posterior densities.

Numerical differentiation
As a second numerical example, we consider numerical differentiation of noisy data.The continuous forward model is given with a first kind Fredholm integral equation where the convolving kernel H is a Heaviside step function.If we had a deterministic equation (i.e.no noise term e j ), then y = v is the differentiated function.Hence, we can formulate differentiation as a linear Bayesian inverse problem with observations given as y = Av N + e.For numerical tests, we choose an unknown consisting of a mollifier and a triangle function Derivative of a triangle function is a piecewise constant function, the same two boxcar functions as in Equation (17).Differentiated mollifier is again a smoothly varying function.We have chosen the mollifier constants in such a way that the differentiated mollifier stays in the same range as the mollifier, simply to avoid any visualisation problems in scaling.The unknown function is then x 2 (5−x) − 25 x(5−x) 2 exp 4 − 25 x(5−x) , x ∈ (0, 5) 1, x ∈ [7, 8] −1, (8, 9] 0, otherwise.
In Figure 7, we have numerical derivatives v N on three different meshes, as well as the Gaussian hyperprior process N .In the simulations, we have 101 observations y with measurement noise σ = 0.03.We note that as numerical differentiation is an ill-posed problem, so we cannot use as high noise-levels as in the interpolation examples.However, with the used noise levels, the algorithm finds the edges, as well as the smooth structures.

Two-dimensional interpolation
Similarly to the one-dimensional interpolation examples with Cauchy and Gaussian hypermodels, we can make two-dimensional interpolation.In Figure 8, we have interpolation of noisy observations, originally on a 41 × 41 mesh, and we

Conclusion and discussion
We have considered the construction of hypermodels which promote both smoothness and rapid oscillatory features.The methodology is based on constructing Cauchy and Gaussian hypermodels for Matérn field length-scaling N .We constructed a combined Gibbs and Metropolis-within-Gibbs algorithm for computing estimates of the unknown and length-scaling, respectively.In addition, we have shown both analytically and numerically discretisation-invariance of the estimates.The estimates provide significant advances in comparison to standard constant-parameter Matérn field priors, as we can detect more versatile features.In this study, we did not include all the Matérn field parameters in the hyperprior.In the future studies, e.g. in the two-dimensional problems, we should have hypermodel fields for 1 , 2 , θ and in addition to the variance scaling mask σ 2 .We consider this paper to be a concept paper and hence we have considered simple inversion examples.However, the methodology can be applied to e.g.electrical impedance tomography, Darcy flow models and X-ray tomography.In addition, implementing spatiotemporal models with infinite-dimensional Kalman filter techniques would be an interesting path forwards.In the more theoretical side, we should study the discretisation-invariance issues more rigorously.Also, the computational machinery needs to be developed further, for example by using MCMC algorithms, i.e. the Metropolis-within-Gibbs can be run with multicore computers.Utilisation of GPUs would also be of interest.
Realisations v N ω given Gaussian process hyperparameter u N ω .

Figure 1 :Figure 2 :
Figure 1: Examples of constructing non-stationary Matérn realisations with hypermodels.Top panel -from left to right: Realisation N ω given Cauchy walk as u N ω , resulting covariance matrix, and four realisations.Bottom panel: Same as above, but with a Gaussian process hyperprior.
n ← n + 1, and repeat from step (i) until n = N (c) Set k ← k + 1, and repeat from step (a) until the desired sample size K is reached.

Figure 3 :
Figure 3: Top panel: 81 noisy measurements and estimated N (N = 161) with Cauchy noise (B) and Gaussian hyperprior (C).(D,G,J) are conditional mean estimates of v N (N = 161) with long length-scaling (D), N minimising MAE (G), N minimising RMSE.(E,H,K) and (F,I,L) are CM-estimates of v N on different meshes with Cauchy hypermodel and Gaussian hypermodels, respectively.18

Figure 4 :
Figure 4: Estimates of N and v N with a Cauchy walk hypermodel u N on different lattices with 81 measurements, with the number of unknowns varying as in figures.Bottom four subfigures (G-J) are chains and cumulative means of certain N and v N elements.

Figure 5 :
Figure 5: Estimates of N and v N with a Gaussian hyperprior u N on different lattices with 81 measurements, with the number of unknowns varying as in figures.Bottom four subfigures (G-J) are chains and cumulative means of certain N and v N elements.

Figure 6 :
Figure6: Multimodality of the posterior densities v N j at the edge around x ≈ 8, when using Gaussian hyperprior and Equation(18).

Figure 7 :
Figure 7: Numerical differentiation of a noisy signal with the developed Gaussian hypermodel.We plot v N on different meshes for seeing the discretisationinvariance of the estimates.

Figure 8 :
Figure 8: Two-dimensional interpolation of block-shaped and Gaussian-shaped structures from noisy observations.A Matérn hyperprior is used in the analysis.