Certiﬁed and fast computations with shallow covariance kernels

. Many techniques for data science and uncertainty quantiﬁcation demand eﬃcient tools to handle Gaussian random ﬁelds, which are deﬁned in terms of their mean functions and covariance operators. Recently, parameterized Gaussian random ﬁelds have gained increased attention, due to their higher degree of ﬂexibility. However, especially if the random ﬁeld is parameterized through its covariance operator, classical random ﬁeld discretization techniques fail or become ineﬃcient. In this work we introduce and analyze a new and certiﬁed algorithm for the low-rank approximation of a parameterized family of covariance operators which represents an extension of the adaptive cross approximation method for symmetric positive deﬁnite matrices. The algorithm relies on an aﬃne linear expansion of the covariance operator with respect to the parameters, which needs to be computed in a preprocessing step using, e.g., the empirical interpolation method. We discuss and test our new approach for isotropic covariance kernels, such as Mat´ern kernels. The numerical results demonstrate the advantages of our approach in terms of computational time and conﬁrm that the proposed algorithm provides the basis of a fast sampling procedure for parameter dependent Gaussian random ﬁelds.

1. Introduction.Deep neural networks (DNNs) play a fundamental role in modern machine learning and artificial intelligence, see [24] for an overview.They are an example for a deep model, which is constructed by composing a number of mathematical models ('layers').By this construction, deep models allow for much more flexibility in data-driven applications than offered by a standard model.The layers in deep models are often rather simple, like artificial neurons in DNNs or linear functions.Hence, the computational cost when training a DNN is caused by a large number of layers, not by the complexity of the single models.In other situations, however, the layers themselves are highly complex.
Here, already the training of a shallow model -consisting of few layers -may require an immense computational cost.
In this article, we consider a specific shallow model with two layers.The outer layer is a Gaussian process on a compact domain with a covariance kernel containing unknown parameters.The inner layer is a probability measure that describes the distribution of the unknown parameters in the covariance kernel.We refer to such a shallow model as a shallow Gaussian process in contrast to deep Gaussian processes that recently gained popularity in the literature; see [7,9,11].Shallow Gaussian processes appear especially in the uncertainty quantification literature, see [10,28,31,37,44], where they are used to model random, function-valued inputs to partial differential equations.Additionally, they are applicable in a variety of statistical or machine learning tasks; see, e.g., [21] and Chapter 5 of [33].
Figure 1: Illustration of a shallow Gaussian process.On the inner layer, a random variable θ with law µ ′ is sampled.On the outer layer, the process X ∼ M (•|θ) is sampled.Then, X follows the distribution of the shallow Gaussian process µ.
In practice, the Gaussian process on the outer layer is discretized with a very large number of degrees of freedom n ∈ N. The discretization induces a parameterized Gaussian measure on (R n , BR n ) of the form M (•|θ) := N(m(θ), C(θ)) (θ ∈ Θ).
Here θ refers to the unknown parameters that are contained in the parameter space Θ, which is associated with some σ-algebra A. Moreover, m : Θ → R n is a measurable function and C : Θ → SPSD(n) is a measurable function mapping parameters θ ∈ Θ to the set of symmetric positive semidefinite (SPSD) matrices in R n×n .By this assumption, M is a Markov kernel mapping from (Θ, A) to (R n , BR n ).The inner layer is represented by a probability measure µ ′ on (Θ, A).The shallow Gaussian process is then defined as the composition of µ ′ and M : which is a well-defined probability measure on (R n , BR n ).In Figure 1, we visualize the simple graph structure behind the shallow Gaussian process µ.We refer to a measure µ of the described form as a hierarchical measure.
To employ µ in a statistical or learning task, we have to sample from M (•|θ) for a large number of parameter samples θ ∈ Θ.In general, sampling requires a Cholesky factorization or spectral decomposition of C(θ); see, e.g., [17, p. 1873].We assume here that C(θ) is a dense matrix.Then, sampling from M (•|θ) in the standard way comes at a cost of O(n 3 ) operations for each θ ∈ Θ.Since n is assumed to be very large, performing these tasks for many θ ∈ Θ becomes computationally infeasible.
The main contribution of this article is a new variant, parameter-dependent ACA, that aims at approximating C(•) simultaneously across the whole parameter range.Indeed, we replace C(•) by some low-rank approximation C(•) for which sampling can be done computationally fast, even if n is large.We define the corresponding approximate outer layer Markov kernel and shallow Gaussian process µ := µ ′ M .In our method, C(•) is constructed by a greedy procedure that controls the errors in the Wasserstein distance W 2 .That means, we aim at an approximation that is guaranteed to be accurate in terms of This is a significant improvement compared to the heuristic algorithm proposed by Latz et al. [28].Such approximation methods are called certified.Further contributions of this work are the following: • We propose and discuss techniques to approximate parameterized covariance matrices by linearly separable expansions, as required by ACA and other low-rank techniques, • we analyze the robustness with respect to loss of positive semi-definiteness of the new parameter-dependent ACA and we demonstrate that it has linear cost O(n), • we show computational experiments in which the techniques are used to construct low-rank approximations of matrices generated from Matérn covariance kernels; see, e.g., [38,Subsection 2.10].This work is organized as follows.In Section 2, we compute upper bounds for the Wasserstein distances in (1.1).These will be used for error control in the parameterdependent ACA that we propose in Section 3. In Section 4, we give examples for isotropic covariance kernels and discuss approximation strategies that lead to linearly separable covariance kernels.Finally, we verify our theoretical findings in numerical experiments in Section 5 and conclude the work in Section 6.

Error analysis.
As discussed in Section 1, the approximations µ, M of µ, M are obtained by replacing the parameterized covariance matrices by low-rank approximations.In this section, we quantify the error that is introduced by such an approximation.
Various distances for measures have been proposed, such as the total variation distance, the Hellinger distance, the Prokhorov metric, the Kullback-Leibler divergence, and the Wasserstein distance; we refer to Gibbs and Su [15] for an overview.Let ).In this case, M (•|θ) and M (•|θ) are singular measures, which makes the total variation distance and equivalent metrics unsuitable: for singular measures the total variation distance is always equal to 2. The Wasserstein distance W p , however, is suitable, as it is closely related to weak convergence; see [43,Theorem 6.9].Definition 2.1 (Wasserstein).Let ν, ν be two probability measures on (R n , BR n ).We define Coup(ν, ν) to be the set of couplings of ν, ν, i.e. the set of probability measures H on (R 2n , BR 2n ), such that ), the Wasserstein(-p) distance between ν and ν is defined by provided that this integral is well-defined.
The Wasserstein distance W p is motivated by the idea of optimal transport.It is defined as the cost of an optimal transport between the two probability measures, measured in the pnorm.The distance is well-defined, if the p-th absolute moment of both measures is finite.This is for instance the case for measures with light tails, such as the Gaussian measures that we encounter throughout Subsection 2.1.For more details on the Wasserstein distance, we refer to the book by Villani [43].
Throughout the rest of this paper, we assume, without loss of generality that the parameterized Gaussian measures have mean zero, i.e., m ≡ 0.
2.1.The fixed-parameter case.We commence with the analysis of the Wasserstein distance W 2 for two Gaussian measures for fixed θ ∈ Θ.A well-known result from [14] states that Throughout the remainder of this section, we assume that C − C is SPSD.Note that this assumption can be written as C C, where denotes the Löwner order on SPSD(n).
If this assumption holds, we can bound the Wasserstein distance W 2 between the measures κ and κ in terms of the trace of C − C. Lemma 2.2.Let κ := N(0, C) and κ := N(0, C) be Gaussian measures on R n and let Proof.The set of couplings Coup(κ, κ) contains the following set of Gaussian measures on R 2n : where j , . . ., X (n) j ), j = 1, 2. The result of the lemma now follows from choosing Σ = C, provided that this choice leads to a measure in E, that is, the matrix is SPSD.On the other hand, this follows immediately from the fact that G can be written as a sum of two SPSD matrices: where C − C is SPSD by assumption.
Using the exact formula (2.1) for the Wasserstein distance, the following example verifies the tightness of the bound in Lemma 2.2.
Example 2.3.Let n = 100.In two settings, we consider a measure κ := N(0, C) and a sequence of measures κ i = N(0, C i ), i = 1, ..., 100.(i) We consider C := Id 100 to be the identity matrix in R 100×100 and C i to be the diagonal matrix with C i jj = 1, if j ≤ i, and C i jj = 0, otherwise.(ii) We consider C := exp(−(j − k) 2 ) 1≤j,k≤100 a discretization of the 1D Gaussian covariance kernel and C i := (i/100) • C. Figure 2 shows W 2 (κ, κ i ) and its upper bound in Lemma 2.2.
We observe that for Example 2.3(i), the bound from Lemma 2.2 coincides with the Wasserstein distance.This is not a coincidence, since one can show the following corollary.
Corollary 2.4.Let κ := N(0, C) and κ := N(0, C), where C − C is SPSD.Additionally, suppose that C, C commute and Proof.Since, C and C commute, their square-roots commute as well, and hence it follows from (2.1) that trace Taking the square-root on both sides proves our claim.
Remark 2.5.The assumptions of Corollary 2.4 are for instance satisfied if C is given via a truncation of the spectral decomposition of C. Such a low-rank approximation is given when employing a principal component analysis (PCA) or Karhunen-Loève (KL) expansion of the Gaussian random vectors with covariance C. In particular, let V diag(e 1 , ..., e n )V −1 = C be the spectral decomposition of C and C := V diag( e 1 , ..., e n )V −1 , where some of the eigenvalues of C are replaced by zeros, i.e. e i ∈ {e i , 0}, i = 1, ..., n.The Wasserstein distance between mean-zero Gaussian measures with these two covariance matrices is then given by Notably, this is identical to the L 2 error obtained from dropping terms in the PCA.
For general low-rank approximations C of C, we cannot expect the assumptions of Corollary 2.4 to hold.Clearly, in Example 2.3 (ii), the bound is not exact.However, it shows the same qualitative convergence behavior.
In practice, we cannot compute the integral on the right-hand side of (2.2).A straightforward alternative is a Monte Carlo approximation with samples from µ ′ .If the parameter space Θ is low-dimensional using a deterministic quadrature rule would also be possible.
3. Low-rank approximation of the covariance operator.
3.1.Cross approximation.Cross approximation [3] is a technique to construct a lowrank approximation of a matrix from some of its rows and columns.More specifically, for a matrix A ∈ R n×n let us consider the rows A(I, :) and columns A(:, J) corresponding to index sets I = {i 1 , . . ., i k } and J = {j 1 , . . ., j k }, respectively.Assuming that the cross matrix is invertible, the cross approximation of A associated with I and J is given by (3.1) A(:, J)A(I, J) −1 A(I, :).

3.2.
Adaptive cross approximation for an SPSD matrix.The choices of k and of the index sets I, J are crucial for the accuracy of the cross approximation (3.1).The adaptive cross approximation algorithm (ACA) chooses I, J in an adaptive way via a greedy strategy, analogous to complete pivoting in Gaussian elimination.In the first step of the procedure, i 1 , j 1 are chosen such that the pivot element A i 1 ,j 1 has maximal absolute value.Then the procedure is repeated for A replaced by the residual matrix A − A(:, i 1 )A(i 1 , j 1 ) −1 A(:, j 1 ) T , and so on.For an SPSD matrix A, the entry of maximal absolute value can always be found on the diagonal.As the residual matrix is again SPSD, this implies that ACA for SPSD matrices chooses index sets I, J such that I = J.This greatly reduces the cost of ACA because only the diagonal elements of the residual matrices need to be inspected in order to determine the pivot elements.Therefore, ACA has linear complexity with respect to n and returns an SPSD cross approximation of the form As the residual matrix E I := A − A I is also SPSD, its trace equals its nuclear norm.This implies In view of Lemma 2.2, if A is a covariance matrix then trace(E I ) 1/2 is an upper bound for the squared Wasserstein distance W 2 between the Gaussian measures N(0, A) and N(0, A I ).
The bound trace(E where k = |I| and σ k+1 (A) denotes the (k + 1)th singular value of A, has been proved in [13,Theorem 1].This estimate is pessimistic and in the typical situation ACA yields an approximation error that is much closer to σ k+1 (A).The adaptive cross approximation algorithm for SPSD matrices [19] is reported in Algorithm 3.1.To simplify the description, E I is initialized to the complete matrix A. In a practical implementation, only the diagonal elements of A are used and line 7 is replaced by updating the diagonal entries of E I only.In turn, only the diagonal entries and the columns corresponding to I of A need to be evaluated.The cost of k iterations of Algorithm 3.1 is O((k + c A )kn), where c A denotes the cost of evaluating an entry of A.
if trace(E I ) ≤ tol then return I 13: end procedure 3.3.Adaptive cross approximation for a parameter-dependent SPSD matrix.The approximation of a parameterized Gaussian random field requires the approximation of a parameterized family of covariance matrices, see Theorem 2.6.In this section we introduce an extension of ACA to the case of a parameterized family of matrices A(θ) : Θ → R n×n such that A(θ) has an affine linear expansion: and A(θ) is SPSD for each θ ∈ Θ.Again, we aim at designing a matrix free method, so we assume to have an efficient way to extract entries from the matrices A 1 , . . ., A s in place of forming them explicitly.Moreover, we replace the parameter space Θ with a finite surrogate set Θ f = {θ 1 , . . ., θ m } ⊂ Θ, e.g. a uniform sampling over Θ or, in the light of Theorem 2.6, samples from µ ′ .The goal of the algorithm is to provide a subset of indices I such that is a uniformly good approximation of A(θ), i.e. trace(A(θ) − A I (θ)) ≤ tol for any θ ∈ Θ.Following a popular technique in reduced basis methods [22], the core idea of the method is to augment the set I with a greedy strategy.First, the value θ * which verifies is identified.Then, one step of Algorithm 3.1 is applied to the matrix A(θ * ) − A I (θ * ) returning the new index that is appended to I.
The most expensive part of the method sketched above is finding θ * since it requires to evaluate the objective function for any point in Θ f .In order to amortize the cost of multiple evaluations of trace(A(θ) − A I (θ)) = trace(A(θ)) − trace(A I (θ)) we precompute once and for all t j := trace(A j ), j = 1, . . ., s so that for every θ ∈ Θ f , computing the quantity requires 2s − 1 flops and s evaluations of the scalar functions ϕ j .In addition, whenever I is updated, we store the matrices A 1 (I, I), . . ., A s (I, I) and we compute the compact QR factorization In particular, for all θ ∈ Θ we have that By denoting with R A (θ) the Cholesky factor of A(θ)(I, I), we can rewrite trace(A I (θ)) as follows: trace where we used the cyclic property of the trace for deriving the third equality.The considerations above lead to Algorithm 3.2.
Remark 3.1.Instead of using the QR decomposition, the computation of trace(A I (θ)) can also be carried out by precomputing the quantities and exploiting the relation where B(θ) := s i,j=1 ϕ i (θ)ϕ j (θ)B ij .This formula might be implemented in place of (3.6) which requires to compute the QR decomposition of A(θ)(:, I).However, forming the matrix B(θ), which is the principal submatrix of A(θ) 2 corresponding to the index set I, causes a loss of accuracy in the evaluation of the stopping criterion.Due to the squaring of the singular values of A(θ)(:, I), the computed quantity is unreliable when trace(E I ) is below the square root of the machine precision.Compute A(θ)(I, I) = s j=1 ϕ j (θ)A j (I, I) return I 25: end procedure Note that the matrix Q I in line 8 is actually not needed.More importantly, a careful implementation of Algorithm 3.2 recognizes that each A j (:, I) grows by a column at a time.Using QR updating techniques [16], the cost of executing line 8 in each outer loop is O(ns 2 k) -instead of O(ns 2 k 2 ) when computing the QR decomposition from scratch in each loop.Similarly, updating techniques could be used to accelerate the inner loop, but these are likely less important as long as s and k are much smaller than n.Without such techniques, one inner loop costs O(sk 3 + s 2 k 2 ).In summary, the total cost of executing k iterations of Algorithm 3.2 is O(ns 2 k 2 + m(sk 4 + s 2 k 3 )).Additionally, O(kns) evaluations of entries from the coefficients A j are needed.
3.4.Robustness of ACA.As we will explain in Section 4 the affine linear expansion (3.3) is often assured by an approximation of the true covariance operator.This approximation error may destroy positive semi-definitness, a property that is assumed throughout the derivations above.
For indefinite matrices, the stopping criteria at line 8 of Algorithm 3.1 and at line 18 of Algorithm 3.2 lose their justification because the relations (3.2) between the trace and the Frobenius norm do not hold.On the other hand, if A is symmetric and very close to an SPSD matrix the trace of A is still nearly an upper bound for its Frobenius norm.Establishing a similar result for the approximation error A−A I returned by Algorithm 3.1 is more subtle because it is not clear whether A − A I remains close to being SPSD.Inspired by the stability analysis of the pivoted Cholesky factorization [25, Section 10.3.1], the following lemma provides such a result.Lemma 3.2.Consider symmetric matrices A, E ∈ R n×n such that A = A + E is SPSD and E 2 ≤ δ for some δ > 0. Let A I be the cross approximation returned by Algorithm 3.1 applied to A with tolerance tol.If ρ := δ A(I, I) −1 2 < 1 then where λ j and λ j , j = 1, . . ., n − k, denote the ordered eigenvalues of S and S, respectively.
This bound implies where the last inequality uses the stopping criterion tol ≥ trace(A − A I ) = trace(S) of Algorithm 3.1.
The upper bound of Lemma 3.2 indicates that the impact of the inexactness δ on the robustness of the trace criterion (3.2) may be magnified by the quantity W 2 2 .Typically, W 2  2 remains moderate but it is difficult to provide a rigourous meaningful bound.The quite pessimistic bound W 2 F ≤ 1 3 (n − k)(4 k − 1) applies to some particular cases, see [25, Lemma 10.13].

ACA sampling and its complexity.
Let us now consider a parameterized Gaussian measure M , as introduced in Section 1, with linearly separable covariance operator C : Θ → SPSD(n).In the following, we explain how the ACA approximation of C enables us to sample fast from an approximation of M (•|θ) for a large number of θ ∈ Θ.
After applying param ACA (Algorithm 3.2) to C, we obtain the approximate parameterized matrix C I : Θ → SPSD(n) defined in (3.4).The corresponding approximate Markov kernel M is given by M (•|θ) := N(0, C I (θ)).After this offline phase, we proceed with the online phase, that is, the sampling from M .For a given parameter θ ∈ Θ, we compute the Cholesky factor L ∈ R k×k of C(θ)(I, I).For a sample ξ ∼ N(0, Id k ) we have ).Finally, the sample is obtained by the matrix vector product C(θ)(:, I)b, which costs another O(nk).In summary, the computational cost of one sample in the online phase is given by O(kns + k 3 ).Hence, the cost of the complete procedure for drawing u ∈ N samples is given by Hence, the cost is linear in the size of the matrix n, the number of test parameters m, and the number u of different samples.As our approach is most meaningful in situations where the rank k is much smaller than n and s is small, the terms k 2 ns 2 and knsu will typically dominate the offline and online phases, respectively.Compared to applying the non-parameterized ACA u times, which costs O(k 2 nu), param ACA can be expected to be cheaper as long as s remains small relative to k and m is not too large.
4. Isotropic covariance kernels.Let D ⊆ R ℓ be some domain and • D be a norm on D. A covariance kernel is a function c : D × D → R that is continuous, symmetric, and positive semidefinite.In the following, we consider isotropic covariance kernels c, which take c(x, y) = c( x − y D ) ∀x, y ∈ D for some function c : [0, ∞) → R. Such a kernel is invariant with respect to isometries, e.g., translations and rotations in the case of the Euclidean norm.Given points x 1 , . . ., x n ∈ D, the covariance matrix C ∈ R n×n associated with c is defined by Parameterized covariance kernels are functions c : D × D × Θ → R such that c(•, •; θ) is a covariance kernel for any θ ∈ Θ.The associated parameterized covariance matrix C : Θ → R n×n is defined analogously as in (4.1).

Examples.
The following examples of parameterized covariance kernels will be considered throughout the rest of this work.We refer to the work by Stein [38] for further examples.
The correlation length has the same interpretation as for Gaussian kernels.Moreover, a Gaussian kernel with correlation length θ 1 can be viewed as a special case of a Matérn kernel by taking the limit θ 2 → ∞.For finite values, the parameter θ 2 determines the smoothness of the random samples; see, e.g., [39,Proposition 3.2] for details.
We aim at applying Algorithm 3.2 to obtain low-rank approximations of the covariance matrix C(θ) associated with any of the parameterized covariance kernels above.However, the linear separability assumption (3.3) is not satisfied because the kernels depend nonlinearly on θ.To address this issue, an approximate linearization of Matérn kernels using Taylor expansion has been proposed in [28].For smaller correlation lengths, this method becomes inefficient and we therefore discuss more general and more effective methods for linearising isotropic covariance kernels in the following.By defining A j to be the matrix associated with a j ( x−y 2 ), as in (4.1), it follows that the matrix C s (θ) is associated with c s (x, y; θ).Note that, unless (4.5) is an exact expansion, the matrices C s (θ) -while still being symmetric -are not guaranteed to be positive semi-definite for every θ ∈ Θ. Lemma 3.2 indicates that ACA remains robust despite the loss of semi-definiteness and, assuming that the tolerance of ACA is on the level of the approximation error of (4.4) then the total error committed by ACA and the expansion can be expected to stay on the same level.The separable expansion (4.4) is the continuous analogue of low-rank matrix approximation.Several methods have been proposed to address this task, see, e.g., [4,40,32].In the following sections we describe how we compute such an expansion depending on the number of parameters. . . . . . .
Intuitively, the product F • M with M ∈ R r×p is defined as the [d, d] × p quasimatrix whose columns are linear combinations of the columns of F .Moreover, the most used matrix factorizations, e.g.LU, QR and SVD, can be extended to quasimatrices [42].Analogously, we define r-tuple of real functions of θ as transposed quasimatrices, i.e. r × Θ matrices, and the corresponding matrix operations.The algorithm that we employ for computing the affine linear expansion (3.3) combines EIM with the SVD of quasimatrices.It proceeds as follows: 1. Generate samples θ 1 , . . ., θ r from Θ.
2. Define the snapshot functions ĉj (d) := c(d, θ j ), j = 1, . . ., r and form the quasimatrix 3. Compute the SVD of F : Truncate the SVD decomposition by keeping the first s columns of u 1 (d), . . ., u s (d) of U , which corresponds to the diagonal entries in Σ that are above a prescribed tolerance τ .4. Determine a set I = {d 1 , . . ., d s } of nodes for interpolation by using the greedy strategy of EIM, reported in Algorithm 4.1.5. Set a j (x, y) := u j ( x − y 2 ), j = 1, . . .s and define the functions ϕ j (θ) via the transposed quasimatrix which solves the linear system    The whole procedure is reported in Algorithm 4.2.The practical implementation makes again use of the toolbox Chebfun2 which provides all the functionalities for handling quasimatrices.
The sampling at step 1 can be performed via a greedy algorithm, as in [2], or with other strategies that aim at mitigating the curse of dimensionality when considering a high dimensional parameter space [5,23,30].Since in our numerical tests we do not need to deal with very high dimensional parameter spaces, we just perform a tensorized equispaced sampling of Θ.The accuracy of Algorithm 4.2 is tested in the following example.
5: I ← node selection(U (:, 1), . . ., U (:, s)) 6: U s ← U (I, 1 : s) 7: for j = 1, . . .s do 8:  We discretize the random field on a regular grid with n := n 2 0 grid points.In turn, the matrix C(θ) ∈ R n×n is given by where div(i, n 0 ) represents the integer division of i by n 0 and mod(i, n 0 ) represents the associated remainder term.We illustrate the node positions with associated indices in Figure 5.Note that this covariance matrix corresponds to a finite element discretization of the random field using piecewise constant ansatzfunctions with square supports and quadrature with one node in the centre of each of the supports.
Remark 5.1 (BTTB).Note that the matrix C(θ) is Block Toeplitz with Toeplitz Blocks (BTTB).This is caused by the regular distribution of the nodes and the isotropy of the covariance kernel.For covariance matrices with this structure, efficient low-rank approximation [18] and sampling strategies (circulant embedding) are available.Note that, our ACA method does neither require nor profit from this structure.Thus, the results discussed in the following extend to covariance matrices that are not BTTB which arisefor instance -when considering more complex geometries D or irregular discretizations.
Test 1.As a first test, we set n 0 = 512, i.e. n ≈ 2.6 • 10 5 degrees of freedom and the cardinality of Θ f to 1000.Then, we run Algorithm 3.2 using tol = 10 −1 as value for the tolerance at line 8 of Algorithm 3.1.This tolerance guarantees that a KL expansion with the same Wasserstein(2) error would represent 90% of the random field's variance; this is a usual choice in random field discretizations (see Remark 2.5).
The evolution of the trace error as the algorithm proceeds is shown in Figure 6.The algorithm terminates after 65 iterations meaning that a 65-dimensional basis is sufficient to represent the parameter-dependent 512 2 × 512 2 operator with the desired accuracy.Note that Algorithm 3.2 always picks the smallest correlation length θ * = 0.1 in the for-loop beginning in line 10.The computational time of the algorithm is 336.8 seconds.We remark that the updating strategy for the QR factorization at line 8 of Algorithm 3.2, described in Subsection 3.3, is crucial.Indeed, computing the QR factorization from scratch in each iteration increases the execution of Algorithm 3.2 to 1593.3 seconds; 80% of which are spent on line 8.
Test 2. In order to shed some light on the role played by θ, Figure 7 shows the sorted eigenvalues of C(θ) for θ ∈ [0.1, √ 2] when using n 0 = 128 for the discretization.We conclude that a smaller correlation length leads to a slower eigenvalue decay.Consequently, a small value of θ implies that a larger rank in a low-rank approximation of C(θ) is needed in order to retain the same accuracy.8 (right).We see that the rank converges to 65, when refining the random field, suggesting that the rank obtained with ACA might be discretization invariant.Also the rank seems to not be influenced at all by the cardinality of Θ f since the algorithm always chooses the smallest correlation length.In this particular setting, the outcome of the algorithm does not change, as long as θ does not vary.
Test 4. Now, we compare the convergence history of the method with the nuclear norm of the error and the trace error of the best low-rank approximations of C(θ).We remark that the convergence history of Algorithm 3.2 corresponds to the trace error with respect to the approximate covariance kernel, i.e. the quantity max θ∈Θ f trace(C s (θ) − C sI (θ)).Since the approximate covariance kernel is not guaranteed to be SPSD, as true measure of the error we consider the maximum nuclear norm of C(θ) − C sI (θ) over θ ∈ Θ f .Finally, since C sI (θ) is a rank k = |I| matrix for each value of θ, we also compute the benchmark quantity max

C(θ).
We run Algorithm 3.2 on an example of small dimension: n 0 = 20 and |Θ f | = 200.Then, we plot the three quantities discussed above in Figure 9.We see that the trace error computed by Algorithm 3.2 and the nuclear norm of the error coincide as far as they stay above 10 −8 .Then, the nuclear norm of the error stagnates because we have reached the level of inexactness which affects the approximate linear separable expansion.To conclude, we observe that the trace error associated with the truncated SVDs decays only slightly faster confirming the good convergence rate of the approximation returned by Algorithm 3.2.
Test 5. We conclude the experiments on the Gaussian kernel by testing the sampling strategy proposed in Subsection 3.5 for n 0 = 512, and |Θ f | = 100.More specifically, we consider the task of sampling u values of θ from a uniform distribution on [0.1, √ 2] and -for each of those values -generating one sample from N(0, C(θ)).Our strategy is to run Algorithm 3.2 to get the approximated kernel operator C I (•) and then sample from N(0, C I (θ)) as described in Subsection 3.5.We compare this method with applying Algorithm 3.1 directly on C(θ) for each sample of θ.The timings of the two approaches, as the number of samples increases, are reported in Figure 10.Notice that, the time consumption of the strategy based on Algorithm 3.2 starts from a positive value which indicates the cost of the offline phase.We see that, while both methods have linear costs with repsect to u, the online phase of our strategy is faster by a factor about 4. amortize the cost of the offline phase and to make our algorithm more convenient.

Parameterized
Matérn covariance kernel with slower eigenvalue decay.In this section, we modify the parameterized random field considered in Subsection 5.1 by replacing the Gaussian covariance kernel with the Matérn covariance kernel defined in (4.3) with θ 1 ∈ [0.1, √ 2] and θ 2 = 2.5.Also here, we make use of an approximate linearized kernel c K -computed via the function-valued ACA -with cut-off value s = 18 which again provides an approximation error of about 10 −8 .This ensures that the approximation error of cs does not affect the convergence history of Algorithm 3.2 in the considered example.
The samples with this covariance kernel are only twice continuously differentiable, as opposed to the analytic samples from the Gaussian covariance kernel.Hence, we expect a larger numerical rank of the associated covariance matrix C(θ) and a larger rank when employing the ACA algorithm.We run Algorithm 3.2 with n = n 2 0 = 512 2 , |Θ f | = 1000 and tol = 10 −1 .The evolution over the iterations of the trace error computed by Algorithm 3.2 is reported in Figure 11.The algorithm terminates after 106 iterations, and it takes 1015.9seconds, averaged over a total of three runs.It thus takes about 3 times longer than a similar experiment with the Gaussian kernel.As expected, the rank that we obtain is indeed larger than before.On the other hand, a 106-dimensional basis is still suitably small to efficiently represent a parameter dependent matrix of size 512 2 × 512 2 .6. Conclusions.We have proposed and analyzed a new method for the low-rank approximation of the Cholesky factor of a parameter dependent covariance operator.The approximation returned by our algorithm is certified, in the sense that it guarantees an upper bound for the error in the Wasserstein distance W 2 between mean-zero Gaussian measures having the true and low-rank covariance matrix, respectively.Algorithm 3.2 leads naturally to a fast sampling procedure for parameterized Gaussian random fields and potentially for the efficient treatment of certain Bayesian inverse problems [10,28].Beyond Gaussian random fields, the parameter-dependent ACA method can be applied in much more general hierarchical, kernel-based learning tasks.Consider, e.g., a non-linear support vector machine with parameterized (i.e.shallow), isotropic, symmetric, and positive semidefinite kernels that shall be trained with a large amount of data.Here, the ACA leads to a natural compression of the kernel matrix.Analysing the error introduced by ACA in other learning tasks would be an interesting topic for future work.
There is also other potential for future work.For example, while we have observed experimentally, for the examples tested, that Algorithm 3.2 returns an index set I of low cardinality that yields a good approximation A I (θ) for all parameter values, it would be interesting to derive a prior existence results based on properties of the kernel.

2 1 2 21
− ρ where W := A(I, I) −1 A(I, I c ) and I c := {1, . . ., n} \ I. Proof.There exists E with the desired properties if and only if the smallest eigenvalue of A is not smaller than −δ and we may therefore assume, without loss of generality, that E = δ • Id.Consider the Schur complements S = A(I c , I c ) − A(I c , I)A(I, I) −1 A(I, I c ), S = A(I c , I c ) − A(I c , I) A(I, I) −1 A(I, I c ).Because ρ < 1, the inverse of A(I, I) exists and the Neumann series gives S = A(I c , I c ) + δ • Id − A(I c , I) j≥0 [−δA(I, I) −1 ] j W = S + δ • Id + δW T j≥0 [−δA(I, I) −1 ] j W, see also [25, Lemma 10.10].In turn, S − S 2 ≤ δ 1 + W −ρ .Because S and S are symmetric matrices, standard eigenvalue perturbation results [16, Thm 8.1.5]imply that

4. 2 . 1 .
1D parameter space.When there is only one parameter θ and Θ is an interval then c(d, θ) becomes a bivariate function on a rectangular domain.In this situation we The name quasimatrix comes from the representation of a generic tuple (f 1 (d), . . ., f r (d)) as a [d, d] × r matrix F := [f 1 (d)| . . .|f r (d)] where the continuous row index refers to the argument d while the discrete column index selects the function.Common operations for matrices can be easily extended to quasimatrices.For instance, given a finite set I = {d 1 , . . ., d s } ⊂ [d, d] we define F (I, :) :=    f 1 (d 1 ) . . .f r (d 1 )

Figure 8 :
Figure 8: Test 3. Timings in seconds (left) and ranks (right) obtained from the ACA algorithm applied to different discretizations of the covariance operator.The x-axes of both figures show the degrees of freedom n = n 2 0 of the discretized covariance operators.The y-axes show the mean elapsed time over three runs for the figure on the left and the quantity |Θ f | for the table on the right.
th vector of the canonical basis 10: end for 11: return a 1 , . . ., a s , ϕ 1 , . . ., ϕ s 5. Numerical experiments.We test the performance of Algorithm 3.2 for the examples discussed in the previous section.All experiments have been carried out on a Laptop with a dual-core Intel Core i7-7500U 2.70 GHz CPU, 256KB of level 2 cache, and 16 GB of RAM.The algorithms are implemented in Matlab and tested under MATLAB2019a, with MKL BLAS version 2018.0.3 utilizing both cores. 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 x 21 x 22 x 23 x 24 x 25 x 26 x 27 x 28 x 29 x 30 x 31 x 32 x 33 x 34 x 35 x 36 x 37 x 38 x 39 x 40 x 41 x 42 x 43 x 44 x 45 x 46 x 47 x 48 x 49 x 50 x 51 x 52 x 53 x 54 x 55 x 56 x 57 x 58 x 59 x 60 x 61 x 62 x 63 x 64 x 8 2 10 2 12 2 14 2 16 2 18