Estimating linear response statistics using orthogonal polynomials: An rkhs formulation

We study the problem of estimating linear response statistics under external perturbations using time series of unperturbed dynamics. Based on the fluctuation-dissipation theory, this problem is reformulated as an unsupervised learning task of estimating a density function. We consider a nonparametric density estimator formulated by the kernel embedding of distributions with "Mercer-type" kernels, constructed based on the classical orthogonal polynomials defined on non-compact domains. While the resulting representation is analogous to Polynomial Chaos Expansion (PCE), the connection to the reproducing kernel Hilbert space (RKHS) theory allows one to establish the uniform convergence of the estimator and to systematically address a practical question of identifying the PCE basis for a consistent estimation. We also provide practical conditions for the well-posedness of not only the estimator but also of the underlying response statistics. Finally, we provide a statistical error bound for the density estimation that accounts for the Monte-Carlo averaging over non-i.i.d time series and the biases due to a finite basis truncation. This error bound provides a means to understand the feasibility as well as limitation of the kernel embedding with Mercer-type kernels. Numerically, we verify the effectiveness of the estimator on two stochastic dynamics with known, yet, non-trivial equilibrium densities.


1.
Introduction. Estimating linear response statistics of dynamical systems under external forces is a problem of broad interest. This forward Uncertainty Quantification (UQ) problem has many important applications. For example, in climate dynamics, the linear response can be used as a proxy that quantifies the climate change statistics corresponding to exogenous forcing such as the volcanic eruptions or even the anthropogenic factor such as the human activities [28,33]. In statistical mechanics, the linear response provides a simple route to determine transport coefficients from microscopic fluctuations [26], such as viscosity, diffusion coefficients, and heat conductivity via Green-Kubo type of formulas [17,25]. One of the popular approaches to quantify the linear response statistics is using the Fluctuation-Dissipation theory (FDT), which in rough terms, states that the leading order (linear) statistical response to small perturbations can be approximated by a two-point statistic of the unperturbed dynamics. In this paper, we consider computing linear response in the context of ergodic stochastic differential equations, the validity of which was studied in [18].
In practice, while the relevant two-point statistics can be numerically estimated by a Monte-Carlo average over samples of unperturbed dynamics at equilibrium state, the integrand (or the function to be averaged) depends on the explicit expression of the equilibrium density of the unperturbed dynamical system which is unknown in general. Examples include PDEs that exhibit spatial-temporal chaos [24], non-equilibrium steady states in statistical mechanics [13,4], coarse-graining molecular dynamics [41], where the potential of mean force has to be estimated from data, etc.
The main problem that arises in such applications is the density estimation of the unknown equilibrium distribution of the unperturbed dynamics, which in turn, allows us to compute the linear response statistics via the FDT theory. Density estimation is a long-standing problem in computational statistics and machine learning. There are many approaches developed to tackle this fundamental problem and one can classify them based on the form of the models such as parametric, non-parametric, or semi-parametric, or based on the training methodology such as linear or nonlinear estimators. The parametric or semi-parametric type models are suitable for problems where some physical knowledge is known. In the absence of physical knowledge, a popular parametric density estimator is the Gaussian Mixture Models (which is also known as the Radial Basis Models in some literature) [21]. Regardless of whether physical knowledge is known or not, most of them are nonlinear estimators since their training phase involves a minimization procedure to estimate the latent parameters. Two practical issues of such nonlinear estimators are: (1) The difficulty in finding the global minimizer using numerical methods that provably converge to local minima, especially when the parameter space is high-dimensional; (2) The identifiability of these parameters when the form of the underlying density function is not known. While these issues are not solved, recent successes of Deep Neural Network (DNN) suggest that high-dimensional functions can be approximated effectively using compositions of ReLU or sigmoid-type functions, even when the global minimizer is not found [7]. In this direction, there are several recently introduced density estimators that have adopted DNN, such as the Neural Autoregressive Distribution Estimation [52] and its variant, the Masked Autoregressive Flow [38]. We should point out that with this type of estimators, the complexity of each function evaluation is on the order of M , where M is the number of parameters in the neural network model [52,53]. For our application, as we shall see, since we need to evaluate the derivatives of the density on each training data point of a set at least on the order of N = O(10 7 ), a small M will be desirable. While nonlinear estimator is a promising direction that warrants a thorough investigation, we will not pursue it in the current paper.
In contrast to these approaches, the nonparametric approach requires the least modeling assumptions and our goal is to construct such an estimator that is consistent in the limit of large data. Among the existing approaches, it is widely accepted that the classical Kernel Density Estimation (KDE) [40] is not effective for problems with dimension higher than three (see e.g., [21,31,53]). In addition to the bandwidth specification issue, the KDE implemented with radial-type kernels, is not practical for our application. In particular, given a training data set of size N , which can be of order 10 7 or larger depending on applications, the estimator will be defined as an average of the radial functions {k(x, where · denotes, e.g, the d-dimensional Euclidean norm for some h > 0. With such an estimator, computing the linear response statistics requires evaluating the norm of the distances between pairs of training data points roughly N 2 /2 times in addition to evaluating h and its derivatives on each of these N 2 /2 distances. Alternatively, a theoretically consistent nonparametric approach that also provably avoids the curse of dimensionality is the Bayesian Sequential Partitioning (BSP) [31,30], which involves the Sequential Important Sampling algorithm. While the dimensionality aspect of BSP is appealing, the fact that the estimator is in the form of piecewise constant functions supported on binary partitions makes it not suitable for our application that requires the functional form of the derivatives of the density. While nonparametric methods that directly estimate the derivative of log density exists (e.g. [44]), such techniques will restrict our problem to exponential type densities. In addition, the method proposed in [44] requires an inversion of a matrix of size N d × N d, which is not feasible for problems with large N .
In this paper, we consider the kernel embedding of distributions [36], which is a linear density estimator. We should point out that the concept of kernel embedding of distribution was introduced in [36] to characterize probability distributions with the kernel mean embedding, which are nothing but statistical quantities of relevant feature maps corresponding to a reproducing kernel Hilbert space (RKHS). Here, we use kernel embedding to estimate probability density functions. In particular, the estimator will be represented as a linear superposition of the RKHS basis functions with the kernel mean embedding as the expansion coefficients. For this application, the direct use of the kernel embedding estimator with radial-type kernels is not practical. Beyond the same computational issue that hampers the KDE method, the kernel embedding implemented with radial-type kernels requires an inversion of an N × N matrix. To avoid the complexity of the function evaluations and the large matrix inversion, we consider the "Mercer-type" kernels constructed using the classical orthogonal polynomials of weighted L 2 -space. Practically, the orthogonality replaces the large matrix inversion problem with a transpose operation. Such representation allows us to conveniently construct a hypothesis model with M expansion coefficients, with M N such that the calculation of linear response statistics amounts to evaluating M polynomial basis functions (instead of N/2 radial basis functions) on N training data points.
We should point out that the resulting estimator (kernel embedding with polynomial basis) is essentially the polynomial chaos expansion (when classical orthogonal polynomials are used, e.g., Hermite polynomials), whose convergence is often understood in L 2 -sense, relying on the Cameron-Martin theorem. See e.g., [56,12] for specific conditions for the convergence. In this paper, we will show that the uniform convergence of the polynomial chaos expansion can be naturally achieved using the theory of RKHS. To achieve this goal, we develop an RKHS using the "Mercer-type" kernels induced by orthogonal polynomials, which then allows its components to inherit the "nice" properties of the kernel, such as boundedness and smoothness. As opposed to the compact domain setting for which the Mercer theorem is valid, these desirable properties, specifically, boundedness and smoothness, are not automatically inherited by the kernel definition alone when its domain is not a compact metric space, which leads us to call the constructed kernel as the "Mercer-type". One of the practical issues in polynomial chaos expansion is how to choose the appropriate basis when the underlying distribution is different than the basic distributions as listed in [56]. This issue can be naturally understood in the RKHS setting as a problem of determining how "large" the resulting RKHS space needs to be to guarantee a consistent estimator. To understand this, we generalize the notion of c 0 -universality of the kernel that was introduced in [45,46] on space of continuous functions on R d with appropriate decay rate. Our study provides a systematic way to choose the basis (or through its corresponding weight function) for approximating density functions with a decaying rate of Gaussian or faster.
While the kernel embedding density estimator is consistent in the limit of large M and N , a finite truncation of these parameters poses an undesirable issue, namely, admitting negative values in estimating positive functions defined on non-compact domains. In principle, we have traded the positivity of the estimators (that is guaranteed using the radial-type kernels) with orthogonality for computational efficiency. In this study, we will provide a detailed characterization of a compact subset for which the density estimate admits only positive values and, most importantly, the linear response statistics can be estimated by a well-defined estimator restricted to this subset up to any desired precision. Practically, the well-posed estimator on this restricted domain can be numerically realized with a straightforward parameter selection criterion. We will use the designed criterion to identify M such that the linear response estimate based on the training data in the restricted domain is an accurate estimator.
Finally, we also provide a statistical error bound for the density estimation that accounts for the Monte-Carlo discretization, averaging over non-i.i.d time series with α-mixing property [11,19]. This error bound provides a means to understand the feasibility as well as limitation of the kernel embedding with Mercer-type kernels in general problems. This error bound serves as a definitive confirmation to the limitation of the polynomial chaos expansion that were reported in many numerical studies, such as [6] in the context of forward UQ and [32] in the context of Bayesian inference. It is, in fact, well-known in statistics that the optimal rate of any linear estimators is of order-N − 2 2 +d [49], where the parameter denotes the smoothness of the function. This means that only when the function is very smooth, such as = d, the error bound becomes independent of dimension, N − 2 3 . Indeed, for data that lie on a smooth d-dimensional manifold embedded in R n , one can construct a Mercer-type kernel based on the orthogonal basis functions obtained via the diffusion maps algorithm [9] and achieve the optimal rate as reported in [49]. While this approach has been explored in [23], the errors in the estimation of these eigenbasis, , do not escape the curse of dimension [16]. Despite this fundamental limitation, our study sheds light on what one can achieve from implementing a linear estimator in approximating linear response statistics. Outline and Main Contributions. We close this introduction with an outline of the rest of the paper, summarizing the main contributions of each section. Main results: • In Section 2, we provide a quick review of the FDT linear response theory and relevant results on kernels and RKHS. To have a well-defined estimation problem, we deduce sufficient conditions to guarantee that the underlying FDT linear response operator is bounded uniformly in time (see Lemma 2.1).
The main novel contribution in this section is the generalization of the c 0universality of kernels to weighted c 0 -universality, which will be used to characterize the denseness of a given RKHS in the space of continuous functions with certain decay rate. We derive the connection between c 0 -universal kernels and weighted c 0 -universal kernels (see Lemma 2.8) so that the principles in c 0 -universality can be shifted to the weighted scenario. • In Section 3, we discuss a framework for constructing RKHS from classical orthogonal polynomials of weighted L 2 -space. The main contribution is summarized in Proposition 1. In Section 3.1.1, we study the RKHS constructed using the Hermite polynomials. In this case, the resulting kernel is the wellknown Mehler kernel [35]. We specify the regularity of the resulting RKHS in Corollary 1. In Section 3.1.2, we also study the RKHS constructed using the Laguerre polynomials. We show the boundedness of the resulting kernel, known as the Hille-Hardy kernel, in Lemma 3.4. In Section 3.2, we show that the RKHS associated with the Mehler kernel is "rich enough" to approximate any continuous density function with Gaussian (or faster) decay-rate of arbitrary variance (see Corollary 2 and Remark 1). • In Section 4, we introduce the kernel embedding approximation to the linear response statistics. To simplify the discussion and provide a concrete error bound, we only present results based on the Mehler kernel. The same overall conclusion holds with different constants in error bound and different weighted L 2 -space, if the same technique is applied on the Hille-Hardy kernel. Based on the RKHS induced by the Mehler kernel, we consider the kernel embedding estimates of the target equilibrium distribution function of an ergodic Itô diffusion. Consequently, we define the kernel embedding linear response operator as the FDT response operator associated with the kernel embedding density estimate. We discuss to which extent the well-posedness of the estimator can be conceived, excluding the data points that admit negative-value density estimates. Then, we summarize the consistency of the proposed estimator in the limit of M → ∞ in Proposition 2. Using the regularity of the functions in RKHS induced by the Mehler kernel, we specify sufficient conditions for the external forces to admit a well-defined FDT response operator for the class of target function in the RKHS (see Remark 2). We also provide a simple criterion for choosing the parameter M and other parameters for a well-defined estimator that avoids negative values density estimates. In Proposition 3, we provide an error bound for the Monte-Carlo approximation under non-i.i.d data and discuss the implication of this result. • In Section 5, we numerically examine the kernel embedding linear response.
We choose two ergodic SDEs with known analytical equilibrium densities so that our approach can be directly validated. We will demonstrate the effectiveness of the Hermite and Laguerre polynomials in approximating densities with symmetric and non-symmetric decaying tails, respectively. Compared to the linear response estimator obtained via the classical KDE, we numerically find that the proposed estimator is computationally more efficient and more accurate. • In Section 6, we close this paper with a summary and discussion on open problems and future research plans that stem from this study.
Some proofs are reported in the Appendices to improve the readability.
2. Preliminary on the theory of linear response and RKHS. In this section, we first review the linear response theory, which motivates the estimation of the equilibrium density of SDEs from their time series. Then we include a short survey on RKHS and the relevant results. We refer readers to [33,36,48] for a more comprehensive discussion.
2.1. Linear response theory. Fluctuation-Dissipation Theory (FDT) is a mathematical framework for quantifying the linear response of a dynamical system subjected to small external forcing [28]. The linear response statistics, determined based on two-point equilibrium statistics of the unperturbed dynamics, provide estimates for the non-equilibrium properties. In statistical mechanics literature, FDT is a linear response approach [14] which serves as a foundation for defining transport coefficients, e.g., viscosity, diffusion constant, heat conductivity, etc. The review will be presented in the context of the d-dimensional (time-homogeneous) SDEs, also known as the Itô diffusions [39]. The unperturbed and perturbed systems of SDEs are written, respectively, as follows, where W t and U t are standard Wiener processes. In both the unperturbed and perturbed systems in (1)-(2), the vector field b : R d → R d denotes the drift and σ : R d → R d × R d denotes the diffusion tensor. In (2), an order-δ (δ 1) external perturbation is introduced, in the form of f (x, t) = c(x)δf (t).
We assume the unperturbed system governed by Eq. (1) is ergodic with a positive equilibrium density p eq (x), where x ∈ R d , as the unique classical solution of the stationary Fokker-Planck equation (see Theorem 4.1 of [39] for detailed conditions). We should clarify that while we refer to p eq as the equilibrium density (following the classical nomenclature of linear response theory in statistical mechanics [26]), the formulation in this paper is not restrictive to reversible processes. We also assume that the statistical properties associated with Eq. (2) can be characterized by a perturbed density p δ (x, t), which solves the corresponding Fokker-Planck equation with initial condition p δ (x, 0) = p eq (x), that is, we initiate the perturbed dynamics in (2) at the equilibrium state of the unperturbed dynamics in (1). In the following formulation, we do not assume that (2) possesses a stationary distribution.
For any integrable observable A(·), we define the difference of two expectations as the full response statistics. In Appendix A of [58], we have showed that computing the non-equilibrium statistics E p δ [A(X δ )](t) in (3) requires extensive numerical simulations of the perturbed dynamics in (2). The linear response theory allows us to estimate the order-δ term of (3) by a convolution integral, avoiding simulations of the unperturbed dynamics in (2). Specifically, FDT formulates the linear response operator, k A (t) in (4), as the following two-point statistics: where B i and c i denote the i th components of B and c, respectively. In [14], the variable B is called the conjugate variable to the external forcing. In general, if A in Eq. (3) is an m-dimensional vector, then the linear response operator k A (t) is an m-by-d matrix. The significance of FDT is that the response operator is defined without involving the perturbed density p δ (x, t). Rather, it can be evaluated at equilibrium of the unperturbed dynamics. For a given t ≥ 0, the value of k A (t) can be computed using a Monte-Carlo sum based on the time series of the unperturbed system (1) at p eq . For example, let {X n = X(t n )} N n=1 be the time series generated at p eq with step length ∆t = t n+1 − t n , then for t = s∆t, the Monte-Carlo approximation can be written as In practice, the computation of (6) can be done efficiently using the block averaging algorithm [15]. In applications, the major issue comes from the conjugate variable B in the linear response operator (5). Since B depends on the explicit formula of p eq , which may not be available, one cannot directly apply the Monte-Carlo approximation in (6) given only the time series of {X n } at p eq . Thus, it is natural to ask how to learn the density function p eq from the observed time series such that the conjugate variable B can be determined via the estimated density,p eq . To guarantee a well-posed estimation problem, the following lemma provides conditions on observables A and B such that the two-point statistics k A (t) in (5) is bounded ∀t ≥ 0.
Lemma 2.1. Let X be the solution of (1) initially at the equilibrium. Assume that A and B in (5) have finite second moments with respect to p eq , then the linear response operator k A (t) in (5) is well-defined. In particular, we have Where the inequality and the square/square root operation are defined componentwise.
Proof. Let L be the generator of the Itô diffusion (1) [39], and e tL be the corresponding semigroup. Let us denote the transition kernel of (1) as, where L * , acting on x, is the adjoint operator of L in the standard L 2 (R d ) space. (1) with initial condition δ(x − y) can be interpreted as a density function of x.
With these definitions, k A (t) in (5) can be specified as a double integral [39] k Using the Cauchy-Schwarz inequality, we have In Eq. (7), the inequality and identity are all defined componentwise.
We should point out that the validity of the linear response theory for SDE has been discussed in [18] under a very general setting. Here, the purpose of Lemma 2.1 is to guarantee that the linear response operator in (5), which is the central object that we wish to approximate in this article, is bounded uniformly in time. The boundedness of the second moment of the observable A with respect to p eq is fulfilled in many applications. As for the conjugate function B, since it is related to the function c(·) in the external forcing through the formula in (5), the Lemma provides a condition for admissible external forcings. In Section 4, for a specific class of p eq , we will provide a more concrete condition on c(·) such that B has a bounded second moment with respect to p eq (see Remark 2).
It is worthwhile to mention that learning the distribution function from the observations is a classical problem in statistics and machine learning. In general, there are two types of approaches: parametric and nonparametric. The kernel embedding formulation, which is the focus of this paper, belongs to the nonparametric category. However, direct use of the kernel embedding formulation with radial-type kernels, e.g., k(x, y) = h( x − y )), to approximate p eq is computationally expensive for this application. In particular, if the length of the training data is N , the Monte-Carlo integral in (6) will require computing the norm in the kernel function N (N − s)times, since the evaluation of B (as defined in (5)) on each sample point X i requires the computation of h( X i − X j ), for all j = 1, . . . , N . This computational cost is similar to the complexity of using the kernel density estimation with radial-type kernels. In addition, the kernel embedding expansion implemented with radial type kernels will require an inversion of an N × N matrix (i.e., solving a large linear system) that determines the expansion coefficients. To overcome these practical issues, we will consider the Mercer-type kernels, constructed using a set of orthogonal polynomials. With such kernels, the resulting kernel embedding approximation on B (or effectively p eq ) becomes a parametric model since the number of parameters is smaller than the size of the data. Furthermore, the expansion coefficients can be determined using only transpose operations (thus, avoiding a large matrix inversion), thanks to the orthogonality.
To facilitate a self-contained discussion, we now provide a quick review of the relevant background material on RKHS.
2.2. Kernels and RKHS. In this subsection, we review the notions of kernel, feature space, feature map, and RKHS. Then we summarize a few useful results which will be used in the later proofs. All unlisted proofs can be found in Chapter 4 of [48]. We will restrict our discussions to real-valued functions since it provides simpler notations and is adequate for our application.
where ·, · H denotes the inner product of H. We call Φ a feature map and H a feature space of k.
By Eq. (8), for any fixed x 1 , x 2 , . . . , x n ∈ X, the n × n Gram matrix is symmetric positive definite (SPD), that is, ∀a = (a 1 , a 2 , . . . , a n ) ∈ R n , the bilinear form We say that a function k : X × X → R is SPD if ∀n ≥ 1, and (x 1 , x 2 , . . . , x n ) ∈ X n the corresponding Gram matrix (9) is SPD. Here, we follow the convention in [48]. Our next lemma states that symmetric positive definiteness is not only a necessary, but also a sufficient condition for k to be a kernel.
The lemma above is useful for checking whether a given function is a kernel. With the concept of kernel, we now define the RKHS.
Definition 2.4. Let X be a non-empty set and H be a R-Hilbert function space over X, i.e., a R-Hilbert space of functions that maps X to R. Then H is called an RKHS with kernel k, if k(·, x) ∈ H, ∀x ∈ X, and we have the reproducing property holds for all f ∈ H and all x ∈ X. In particular, we call such k(·, ·) a reproducing kernel of H.
From Definition 2.4, it seems that the reproducing kernel is a result of RKHS. However, there is a one-to-one correspondence between the RKHS and kernel (see Theorem 4.20 and 4.21 of [48]). In Section 3, where we construct the RKHS for the estimation of linear response, we will first define a kernel, then build an RKHS to "promote" such kernel to a reproducing kernel.
In the rest of the Section, H always denotes as the RKHS with kernel k. The RKHS has the remarkable property that the norm convergence implies the pointwise 452 HE ZHANG, JOHN HARLIM AND XIANTAO LI convergence. More precisely, consider as n → ∞. Eq. (11) also suggests that if k(·, x) H is bounded uniformly in x ∈ X, we will have the uniform convergence of f n to f . We arrive at the following lemma.
Lemma 2.5. Let X be a topological space and k be a kernel on X with RKHS H. If k is bounded in the sense that and k(·, x) : X → R is continuous ∀x ∈ X, then H ⊂ C b (X) (space of bounded and continuous functions on X), and the inclusion id : Here, to see the connection between k ∞ and k(·, x) H , simply notice that k(·, x) ∈ H, and with the reproducing property (10) we have that is, f (x) has the same decay rate as k 1 2 (x, x). In this paper, we focus on the RKHS with a bounded kernel.
As a subspace of C b (X), it is natural to ask whether the RKHS H is dense in the Banach space C b (X) equipped with the uniform norm. The density of H in C b (X) is equivalent to the c-universality [46] of the corresponding kernel k, which has been discussed by Steinwart [47] for compact X. In this paper, we are interested in the case where X is non-compact, e.g., X = R d , and the target f is a continuous density function vanishing at infinity. For a locally compact Hausdorff (LCH) space X, let C 0 (X) denote the space of all continuous functions on X that vanish at infinity, that is, ∀δ > 0, the set {x ∈ X | |f (x)| ≥ δ} is compact ∀f ∈ C 0 (X). C 0 (X), like C b (X), is a Banach space with respect to the infinite-norm · ∞ . As an analog of the c-universal kernel, the concept of c 0 -universal was introduced by Sriperumbudur et al. in [46]. Definition 2.6. (c 0 -universal) Let X be an LCH space and let k be a bounded kernel on X × X and k(·, x) ∈ C 0 (X), ∀x ∈ X. The kernel k is said to be c 0universal if the RKHS, H, induced by k is dense in C 0 (X) with respect to the uniform norm, that is, ∀g ∈ C 0 (X) and ∀ > 0, there exists a functionĝ ∈ H such that g −ĝ ∞ < .
A series of characterizations of c 0 -universality for different types of kernels has been developed in [45,46] based on the Hahn-Banach theorem and Riesz representation theorem. When X = R d , a typical example of c 0 -universal kernel is the Gaussian kernel k(x, y) = exp(−θ x − y 2 ), x, y ∈ R d , for some θ > 0. To facilitate the applications in this paper (see Section 3), we generalize the concept of c 0 -universality to a weighted C 0 -space. Lemma 2.7. Let X be an LCH space and q be a bounded positive continuous function on X. Then we have the following results.

The set of functions
Proof. It is straightforward to check that C 0 (X, q −1 ) is a normed vector space with respect to the norm · C0(q −1 ) . To see C 0 (X, q −1 ) is closed under the topology induced by · C0(q −1 ) , we introduce the linear bijection Q : C 0 (X) → C 0 (X, q −1 ) defined as Q(g) := gq for every g ∈ C 0 (X). The bijection Q is norm-preserving in the sense that ∀g ∈ C 0 (X), In practice, we use the weight function q to characterize the decay rate of continuous functions. For example, take X = R d , and q ∝ exp(−θ x 2 ) for some θ > 0, then the functions in C 0 (R d , q −1 ) are continuous with a Gaussian decay rate. Motivated by the decay rate (12) of functions in H, the following lemma provides conditions for H to be dense in C 0 (X, q −1 ). Lemma 2.8. (weighted c 0 -universal) Let X and q be the same as in Lemma 2.7, and the kernel k, satisfying k(·, x) ∈ C 0 (X, q −1 ), ∀x ∈ X. Then,k(x, y) := q −1 (x)k(x, y)q −1 (y) defines a kernel on X, and the RKHS H induced by k is dense in C 0 (X, q −1 ) if and only if the kernelk is c 0 -universal.
In terms of the reproducing property, ∀g = f q −1 ∈H with f ∈ H, we have

HE ZHANG, JOHN HARLIM AND XIANTAO LI
Thus,H indeed is the RKHS induced byk satisfying Q H ∩ C 0 (X) = H ∩ C 0 (X, q −1 ). Finally, by the fact that Q defines an isometrical isomorphism between C 0 (X) and C 0 (X, q −1 ), we reach the equivalence.
Our next lemma characterizes how the differentiability of a kernel is inherited by the functions of its RKHS. In particular, we take X ⊂ R d to be an open subset, and introduce the multi-index notation m = (m 1 , m 2 , . . . , m d ) with m i being nonnegative integers. Then, we say that the kernel k is M -times continuously differentiable if ∂ m, m k : X × X → R exist and are continuous for all multi-indices m with Lemma 2.9. Let X be an open subset of R d , and kernel k be an M -times continuously differentiable kernel. Then every f ∈ H is M -times continuously differentiable in X, and ∀ m 1 ≤ M , we have From Lemma 2.5 and 2.9 we have learned the importance of constructing reproducing kernel k with certain "nice" properties. In practice, we can construct a bounded kernel using (8) based on a feature map Φ : X → H, where H is a Hilbert space (e.g., 2 -space). Then, we define the corresponding RKHS H as a subspace of C 0 (X) such that H yields the reproducing property. The coming section will follow this procedure of constructing RKHS, using a feature map induced by orthogonal polynomials.
3. From orthogonal polynomials to RKHS. In Section 2, we have discussed the density estimation problem arising from the linear response theory. Given observed time series, our goal will be to approximate the target equilibrium density function p eq with appropriate RKHS functions. In practice, such RKHS should be well selected such that: 1. The corresponding reproducing kernel k has all good properties in Lemma 2.5 and 2.9 so that we can derive convergence results of the estimates. We will show examples of RKHSs constructed from Hermite and Laguerre polynomials (see Section 3.1). 2. The RKHS is rich enough in the sense of Lemma 2.8 such that one can estimate p eq that has a certain decaying rate in arbitrary precision (see Section 3.2).
3.1. Constructing RKHS via orthogonal polynomials. Inspired by the Mercer's theorem [48] and reproducing kernel weighted Hilbert space used in [5,22,23,58], we consider the orthonormal polynomials with respect to L 2 (R d , W ) to construct our kernel and RKHS. Here, For a more concise discussion, we begin our analyses with the class of onedimensional weight functions, satisfying conditions specified in the following lemma. In the lemma, the orthonormal polynomials are obtained via applying Gram-Schmidt process to {1, x, x 2 , . . . } based on the inner product in L 2 (R, W ) and p n always denotes a polynomial of degree n.
Then the orthonormal polynomials {p n } in L 2 (R, W ) satisfy where C 1 and C 2 are positive constants independent of n.
A typical type of functions satisfying the conditions in Lemma 3.1 is Q m (x) = |x| m for m > 1, and we have Let W m = e −2Qm , and (13) leads to With the control of the L ∞ -norm of the orthonormal polynomials in (13), the following lemma defines the bounded kernel we need to build our RKHS.
Proof. Notice that by the uniform bound (13) in Lemma 3.1, we have and combining with the condition (14), we arrive at the uniform convergence of the summation in (15). Thus, k β (x, y) in (15) is a well-defined continuous function on R 2 with a decay rate, To show that k β is a kernel, we define the feature map Φ β : R → 2 as With k β (x, y) = Φ β (x), Φ β (y) 2 , and by Definition 2.2, k β is a kernel.

HE ZHANG, JOHN HARLIM AND XIANTAO LI
To generalize Lemma 3.2 to the d-dimensional case, consider the orthonormal where m = (m 1 , m 2 , . . . , m d ) is a multi-index and {p n } are the orthonormal polynomials in L 2 (R, W ). Following Lemma 3.2, we define for β ≥ 1 2 , as the d-dimensional generalization of the kernel k β in Eq. (15). Here, λ n satisfies the condition in Lemma 3.2. The function k β (x, y) in (17) Moreover, similar to (16), k β (x, y) yields the following decay rate where C 3 is the same constant as in (16). Formally, one can always define a kernel by the infinite sum in (15). In Lemma 3.2, the uniform convergence of the series in (15) is proved based on the asymptotic behavior of a class of orthogonal polynomials (see Lemma 3.1). In practice, one can also rely on existing identities to show the boundedness of the kernel defined in (15) (see Section 3.1.2 below).
Our next task is to define the RKHS, denoted by H β , such that the bounded kernel k β in (17) is the reproducing kernel of H β . A crucial observation is that, by Definition 2.4, a necessary condition for H β is k β (·, x) ∈ H β , ∀x ∈ R d . Thus, we shall first study the function space containing k β (·, x), x ∈ R d . We have the following lemma.
In particular, by the orthogonality of {Ψ β, m }, we have We shall emphasize that L 2 (R d , W 1−2β ) ∩ C 0 (R d ) is the set consisting of all continuous functions vanishing at the infinity which are W 1−2β -weighted L 2 -integrable, while each element in L 2 (R d , W 1−2β ) represents an equivalent class of functions due to the difference of their topologies. With the topology induced by the weighted is valid in the sense that which is relatively weak since most of the properties of f M no longer exist after passing thought the limit. The following proposition, as the main result of Section 3.1, defines the RKHS, (17), such that ∀f ∈ H β , the expansion (20) holds pointwisely and f M converges uniformly to f in C 0 (R d ).
Proposition 1. Let W (x), p n (x), and {λ n } be as in Lemma 3.2. Then, for any fixed β ≥ 1 2 , we have the following results. i For any sequence f m ∈ 2 satisfying where λ m is defined in (17), the sequence of functions converge uniformly in C 0 (R d ). Moreover, the limit, denoted as f * , satisfies is a well-defined subspace of Then ·, · defines an inner product, and H β , equipped with the inner product ·, · , is a Hilbert space. iii H β is the RKHS with reproducing kernel k β in (17).
See Appendix A for the proofs. With k β being bounded, Lemma 2.5 implies that the norm convergence in H β is a sufficient condition of the uniform convergence in C 0 (R d ). In particular, ∀f ∈ H β , the expansion formula (22) converges uniformly, that is, there is a one-to-one correspondence between the function f ∈ H β and the sequence of expansion coefficients f m ∈ 2 satisfying (21). Since the condition (21) is independent of β, then the class of RKHS {H β } defined in Proposition 1 are isometrically isomorphic to each other. In particular, ∀β 1 , β 2 ≥ 1 2 , the linear map I β1,β2 : H β1 → H β2 given by defines the isometrical isomorphism between H β1 and H β2 with I −1 β1,β2 = I β2,β1 . To connect the kernel k β in (17) with Mercer's theorem, we define an integral operator T k β : Set f = Ψ β, n in (23), and we obtain where we have used the fact that the convergence in (17) is uniform to switch the order of integration and summation. Thus, λ m , and Ψ m correspond to the eigenvalue and eigenfunction of T k β , respectively. We have introduced a framework to construct RKHS as a subspace of L 2 (R d , The resulting RKHS extracts features of both L 2 (R d , W 1−2β ) and C 0 (R d ), e.g., the expansion formula (20) makes sense not only in L 2 (R d , W 1−2β ) but also pointwise. The main advantage with the representation in (22) over the radial-type kernel is as follows. In practice, given data {X n } N n=1 sampled from the target density f , we will choose a function in H β with a finite sum, m 1 ≤ M, where M N , as an estimator for f . While the choice of M allows us to specify the theoretical "bias" or "approximation error", thanks to the orthogonal representation (see Section 4), the resulting hypothesis function is parametric and the evaluation of f on a new x ∈ R d amounts to evaluating M +d M components of {Ψ β, m (x) | m 1 ≤ M }. This is computationally much cheaper than evaluating f (x) = f, k(·, x) H with radial-type kernels, such as k (x, y) = h ( x − y ) for some positive function h, since the computation of the inner product requires evaluating h ( X n − x ), for all n = 1, . . . , N .
It is worthwhile to mention that although Proposition 1 relies on the weight function W satisfying the condition in Lemma 3.1, there are kernels and RKHS that are generated from different choices of W . For example, when W has a compact support, the resulting kernel and RKHS is an immediate consequence of Mercer's theorem. In the remaining of this subsection, we discuss two examples, the classes of Mercer-type kernels based on the Hermite and Laguerre polynomials. In particular, for Laguerre polynomials, the weight function does not satisfy the assumption in Lemma 3.1 and we use a different approach to justify the boundedness of the kernel.
3.1.1. Hermite polynomials. Let W be the d-dimensional standard Gaussian distribution, that is, As a result, the corresponding orthonormal polynomials are the normalized Hermite polynomials, denoted by {ψ n }.
Notice that the Laguerre weight function G(x, θ) does not satisfy the hypothesis of Lemma 3.1, so we cannot use Lemma 3.2 to verify the boundedness of the Mercertype kernel as in Eq. (24). Nevertheless, we can still verify the boundedness of the kernel using existing identities. We summarize the result as follows.
See Appendix B for the proof. The Hille-Hardy formula (Eq. (33) with d = 1) can be interpreted as a generalization of the Mehler kernel (25), since Hermite polynomials can be entirely deduced from the Laguerre polynomials [50].
With Lemma 3.4, following Proposition 1, one can construct the RKHS corresponding to the Hille-Hardy kernel in (33) as an analogue of the Melher RKHS in (27). One can also specify the regularity of this function space as in Lemma 1 by checking the boundedness of the derivatives of the Hille-Hardy kernel using some results in [1], which we omitted here for brevity.

3.2.
Universality of the Mehler RKHS. As mentioned in the introduction, for a reliable estimation, we would like to construct a hypothesis space (RKHS) that is "rich" enough to capture particular behavior of the target function. Specifically, we will use the notion of weighted c 0 -universality described in Lemma 2.8 to specify the appropriate Mehler RKHS that can capture the decaying property (Gaussian or faster) of the target density function. The key idea is to understand the "richness" of the RKHS space H β,ρ as we summarize now.
Proof. By Lemma 2.8, we need to check the following two conditions.
Notice that for any fixed y ∈ R d , as a function of x, that is, k β,ρ (·, y)q −1 β,ρ (·) ∈ C 0 (R d ) for all y ∈ R d since ρ ∈ (0, 1). By the definition of C 0 R d , q −1 β,ρ (see Lemma 2.7), the first condition holds. For the second condition, simply notice that x − y 2 , and ρ 2(1−ρ 2 ) > 0, that is,k(x, y) is a Gaussian kernel which is c 0 -universal. Remark 1. Corollary 2 suggests that the Mehler RKHS H β,ρ is rich enough to approximate the functions in C 0 R d , q −1 β,ρ . Moreover, notice that in the formula of q β,ρ in (34), we have Thus, for any continuous target density function f with a Gaussian decay rate, that is, there exist constants θ f , C f > 0 such that we can find β * ≥ 1 2 and ρ * ∈ (0, 1) such that and the decay rate (35) implies f ∈ C 0 R d , q −1 β * ,ρ * . This means that one can approximate any continuous function that has Gaussian (or faster) decaying rate as in (35) up to any desirable accuracy using an estimator that belongs to the Mehler RKHS H β * ,ρ * , which is a space of functions with Gaussian decay rate slower than (35). In practice, one can use the sample excess kurtosis to approximate the decay rate of the target density from the given samples on each coordinate to identify the value of θ f in (35). In terms of tuning parameters ρ and β via (36), since the left-hand side of Eq. (36) depends on (ρ, β) with β−1 2 being the leading-order term, we fix ρ = ρ * ∈ (0, 1) and seek for β = β * ≥ 1 2 that satisfies the inequality (36). Theoretically, we can tune parameter ρ as well by further posing decay rate condition to the partial derivative of the density function f (e.g., Eq. (28)). However, such partial derivatives data, in practice, may not be available.
We should point out that Corollary 2 relies on the analytical expression of the Mehler kernel, which allows us to find an isometrically isomorphic map that transforms the Mehler kernel into a radial kernel that is known to be c 0 -universal. Unfortunately, such a technique is not trivially applicable for the Hille-Hardy kernel since its analytical expression, as shown in (33), is more complicated. One possible approach to justify the "richness" of the Hille-Hardy RKHS is developing a weighted-c 0 universality characterization based on the Proposition 12 in [46], and we leave it as an open problem. Practically, one can choose the appropriate basis guided by the shape parameter θ i in (30) that can be determined from the sample data by the standard MLE method on each coordinate. 4. Kernel embedding linear response. In this section, we first define the kernel embedding estimate for density functions. Subsequently, we introduce the kernel embedding linear response as the main application. As mentioned in the introduction, finite orthogonal polynomials will admit negative-values on the density estimation. Here, we will address this issue and discuss to which extent the consistency of the kernel embedding linear response can be achieved with the proposed algorithm. We will close this section with an error bound for the empirical kernel embedding estimates based on non-i.i.d. data. For simplicity, we only discuss the linear response estimate based on the Mehler RKHS hypothesis space. Similar techniques can be applied to the Hille-Hardy kernel.
We assume the target d-dimensional density function f lives in the Mehler RKHS H β,ρ for some ρ ∈ (0, 1) and β ≥ 1 2 . Recall that, by the reproducing property and the expansion formula (20), we have where W (x) ∝ exp − 1 2 x 2 , and {ψ mi } are the normalized Hermite polynomials. We define the order-M kernel embedding estimates of f , denoted by f M , as the order-M truncation of Eq. (37), that is, We should point out that, with this choice of basis representation, we arrive at a polynomial chaos approximation [56] of f . But the convergence f M → f as M → ∞ is valid in both L 2 (R d , W 1−2β ) and C 0 (R d ). Another remark is that, although the formula of the kernel embedding estimate (38) is independent of the parameter ρ, this parameter does affect the rate of convergence of the residual error f − f M H β,ρ (see Proposition 3). In practice, the integral in (37) can be approximated by a Monte-Carlo average, that is,f where {X n } N n=1 are sampled from the target density function f . Here, the decay rate of f in (29) In our application, the target function is f = p eq . Here, the sample {X n } corresponds to a time series, {X(t n )}, of the unperturbed dynamics (1) generated at the equilibrium. We should point out that the error bound, to be discussed in Proposition 3, will account for the fact that the empirical estimator in (39) is based on non-i.i.d. data. Let's replace the unknown p eq in (5) by its order-M kernel embedding estimates, denoted as p M . We can naively define the order-M kernel embedding linear response operator aŝ for integrable observable A(X) and external forcing c(X)δf (t). Unfortunately, Eq. (41) is not a well-defined estimator for the linear response operator k A (t), since the basis function Ψ β, m (x) in (24), as a product of normalized Hermite polynomials and the standard Gaussian distribution, can be negative.
To overcome this issue, we introduce the following restriction to the order-M density estimate, where D M,δ := {x ∈ R d | p M (x) ≥ δ} and χ D M,δ denotes the characteristic function of D M,δ . For the analysis below, it is helpful to see that (42) Under the decaying rate assumption of p eq , the set {x ∈ R d | p eq (x) ≥ δ} is compact ∀δ > 0. Indeed, with a positive lower bound on p M , the estimatorB i in (41) is well-defined on D M,δ . Here, the choice of δ > 0 depends on the desired precision of the estimator for a fixed M , as we shall explain next.
Similar to the correction of the density estimate, we define the restricted linear response operator Further, we assume that both A(·) and B(·) are fixed and have finite second moments with respect to p eq , then, B D M,δ in (43) also has a finite second moment. Thus, according to Lemma 2.1, k A,D M,δ (t) is well-defined ∀t ≥ 0, and, replacing B by B(1 − χ D M,δ ) in (7), we reach , ∀t ≥ 0. (44) We claim that ∀ > 0, there exists a δ * > 0 such that ∀δ ∈ (0, δ * ) and for M large enough, where the inequality is defined componentwise and uniformly in t. To clarify, without loss of generality, we assume that both A and B are scalars. With E peq [B 2 (X)] < ∞, by the continuity of the integral, there exists a compact set D ⊂ supp{p eq } such that where D c = R d \D denotes the complementary set of D. Take δ * := min{p eq (x) | x ∈ D}. As a result of the uniform convergence of p M to p eq , we have ∀δ ∈ (0, δ * ) and M large enough, δ + p eq − p M ∞ ≤ δ * , that is, where we have used the relation in (42). Thus, , and, following Eq. (44), Eq. (45) holds for all t ≥ 0.
The implication of (45) is the existence of the domain D M,δ such that for large enough M > 0, the restricted linear response estimator in (43) is consistent with k A (t) up to the desirable precision > 0. Numerically, one can consistently estimate the restricted linear response operator in (43) with the following well-defined estimator of k A,D M,δ , The following proposition characterizes the consistency of the estimatork A,D M,δ (t) as M → ∞.
Proposition 2. Consider a d-dimensional Itô diffusion (1) with a positive equilibrium density function p eq ∈ H β,ρ , and its perturbed dynamics (2). Assume that, both, the observable A(·) and the conjugate variable B(·) in (5) have finite secondmoments with respect to p eq . For the restricted linear response operator k A,D M,δ (t) in (43) and its estimatork A,D M,δ (t) in (46), we have uniformly in t. Here, · F denotes the Frobenius norm of matrices.
Proof. To simplify the discussion, without loss of generality, we assume that c(x) in the perturbed dynamics (2) is a scalar function. As a result, following Eq. (5), we have By the integrability assumptions of A and B, both k A,D M,δ (t) and its estimator k A,D M,δ (t) are well-defined, and we have Furthermore we have where we have used the fact that δ is the lower bound of p M on D M,δ . By Lemma 2.5 and Corollary 1, we have the uniform convergence of p M → p eq and ∇p M → ∇p eq as M → ∞, that is, where we used the compactness of {x ∈ R d | p eq (x) ≥ δ 2 } to pass the limit. The first term, as M → ∞. Here, we used the assumption that B has a finite second moment with respect to p eq . Combining (50) with (51), we reach the convergence Together, Eq. (49) and the square integrability of A, Eq. (47) holds and the convergence is uniform in t.

Remark 2.
It is worthwhile to mention that, for p eq ∈ H β,ρ , by the decay rate of p eq (29) and ∇p eq (28), a sufficient condition for B in (48) to have a finite second moment is that c(·) is continuously differentiable and This is a concrete condition that can be used in practice to check whether the FDT response is well-defined under external forces of the form f (x, t) = c(x)δf (t) for unperturbed dynamics with equilibrium density having Gaussian (or faster) decay rate of arbitrary variance.
In practice, we do not have to determine the domain D M,δ . When approximatinĝ k A,D M,δ in (46) via the Monte-Carlo method (Eq. (6)), it is enough to truncate those samples X n where p M,N (X n ) < δ with p M,N given by (40). Numerically, given a training data set of length N , there are two parameters to be determined: M , as the order of the estimator, and δ, as the threshold of the truncation. For each M , to make full use of the data, we set Besides δ M (52), we further introduce: where p M,N andp m,N are given by Eqs. (40) and (39) with f = p eq , respectively. The parameter R M in (53) characterizes the number of training samples whose estimated density function values are smaller than the threshold δ M ; we refer to this parameter as the rejection probability. Small value of R M corresponds to small truncation error ( in (45)). The parameter η M in (54) provides an indicator whether the estimator converges as M → ∞. In practice, the computation of η M can be done in an iterative manner by increasing the parameter M since the orthogonality condition allows one to identify the error by the expansion coefficientsp m,N , which can be computed consecutively. For accurate estimations, we will choose M such that both R M and η M are small. We summarize the numerical scheme into a pseudo-algorithm. Given a time series {X i } sampled from the target unknown d-dimensional density function p eq , and a target two-point satisfies k A (t) in (5), we proceed our estimation following these steps: 1. For each dimension, run a statistical test to identify the tail of the marginal distribution of p eq . Subsequently, we choose an appropriate RKHS basis based on the tail information (e.g., tuning the parameter (β, ρ) in the Mehler RKHS H β,ρ in (27), see Remark 1 for the details); 2. Use tensor product to construct the basis on R d , and compute the empirical kernel embedding estimates of p eq , p M,N , following (39) and (40) with f = p eq ; 3. Use δ (52), R M (53), and η M (54) to identify the proper order of the estimates and truncate sample points with estimate value, p M,N (X i ), less than δ; 4. Replace the unknown function B in the two-point statistics k A (t) (5) withB in (41) and estimate the value via Monte-Carlo based on the truncated data set. In Figure 1, we provide a numerical example as an illustration. In this example, the data points are the time series of the solutions of the dynamical system in Section 5.1. Here, we illustrate the effectiveness of the criterion in (53)-(54) for choosing the parameter M . First, notice that δ M specified as in (52) is roughly constant as a function of M with values fluctuating around the floating-point single precision, 10 −8 . To guarantee a well-defined estimator, one may need to set δ to be slightly larger than the floating-point single precision since (52) can be arbitrarily small. In this example, the choice of M = 60 will correspond to δ M > 10 −7 that is larger than the floating-point single precision. Notice that while the value of η M fluctuates, it eventually converges as M > 50. On the same panel, we also plot the rejection probability R M , which is smaller than .4% as M increases. For 20 < M < 50, the large value of R M is due to Gibbs phenomenon (as the estimate becomes oscillatory near zeros). As we shall see in Section 5.1, we will obtain a reasonably accurate estimation of k A with M = 60. We will also use the same criterion to choose M in the other numerical example in Section 5.2.
The following proposition addresses the Monte-Carlo error of the order-M empirical kernel embedding estimate f M,N (40).
Proposition 3. Given a d-dimensional stationary (time-homogeneous) Markov process X(t) with a discretization {X n := X((n − 1)∆t)} N n=1 that satisfying 1. {X n } yields a stationary distribution f ∈ H β,ρ for some ρ ∈ (0, 1) and β ∈ [ 1 2 , 1 1+ρ ]. 2. {X n } is an α-mixing process [11] with mixing coefficient α(k) satisfies for some ∈ (0, 2]. Then, for d ≤ 3 2 M + 1, we have the following error bound for the empirical kernel embedding estimate f M,N in (40) f H β,ρ , and the expectation in (56) is defined over random variables {X n } N n=1 . See Appendix C for the proof. The first part of the upper bound in (56), with factor 1 N , corresponds to the estimation (or Monte-Carlo) error, which consists of the variance and covariance terms. The covariance term accounts for the error due to averaging over correlated or non-i.i.d. samples. The second part, depending on the parameter ρ ∈ (0, 1), corresponds to the approximation error or bias due to finite truncation, M . Asymptotically, it is clear that the error vanishes as M → ∞ after N → ∞.
If {X n } N n=1 are i.i.d. samples of f , the error bound (56) reduces to where we are left with the variance and bias terms in the upper bound. We should point out that, in this case, the assumption of d ≤ 3 2 M + 1, which is used to obtain the upper bound for the non-i.i.d. estimation error (56), can be neglected. First, we note that the error estimate shows that the approximation error (or bias) is independent of the dimension. In fact, the approximation error in (57) decays exponentially as a function of the model parameters, M , since 0 < ρ < 1. This fact is not so surprising since the target function, f ∈ H β,ρ , is bounded, smooth, and has Gaussian decay. However, the estimation error depends exponentially on the dimension. This curse of dimensionality is due to the use of the tensor product on an orthonormal basis. While the constants in error bound may be different, the estimation with the Hille-Hardy RKHS should also suffer the same curse of dimensionality issue, and the method of proof is not different from the one presented in Appendix C.
For data points that lie on an m-dimensional compact manifold M embedded in R d , it was shown in [22] that the estimation error (or variance) can be improved to be exactly M N −1 using the Mercer kernel constructed by the eigenvalues correspond to the orthonormal eigenfunctions of the Laplace-Beltrami (or a weighted Laplacian) operator, defined on functions that take values on M. In fact, if the target function is a Sobolev class H (M), one can show that the approximation error is of order M − 2 m , where the Weyl asymptotic for the eigenvalue of the Laplace-Beltrami (or a weighted Laplacian) operator on compact manifolds [10] is used. Balancing these errors, we obtain the famous optimal bias-variance trade-off rate of order N − 2 2 +m for linear estimators [49]. When the target function is smooth such as = m, the optimal error rate of order N − 2 3 suggests that this approach overcomes the curse of dimension. While this is appealing, the numerical method in approximating the orthonormal eigenfunctions of the Laplace-Beltrami operator, such as the diffusion maps [9], suffers from the curse of dimension with a spectral convergence rate of order-log N N 1 2m [16]. Thus, the curse of dimensionality is not completely avoided.
The only advantage of such nonparametric estimation is when the target function is supported on a low-dimensional intrinsic manifold M that is embedded in a very high-dimensional ambient space R d .
In the next section, we will numerically verify the kernel embedding linear response estimators, constructed using the Mehler RKHS and the Hille-Hardy RKHS. For applications with target functions defined on smooth manifolds embedded in R d , see [5,22]. 5. Numerical examples. In this section, we will test the proposed method on two models, a stochastic gradient system with a triple-well potential (Section 5.1) and a Langevin equation with the Morse potential (Section 5.2). In [57], the authors have explored these two examples. Here, our goal is to demonstrate the numerical implementation of the kernel embedding linear response approach introduced in Section 4 based on different choices of orthogonal polynomials. We also stress that we picked examples with explicit equilibrium distributions so that we can verify the estimates by directly comparing to the exact linear response via the FDT formula as reviewed in Section 2.1.
As for a comparison, we consider the classical kernel density estimator (KDE). Formally, given {X n } N n=1 sampled from a target density function p eq , the KDE estimator is given by,p where h > 0 is the bandwidth parameter, and K is a smooth density function with zero mean and finite second moment [54]. The uniform convergence of the KDE estimates and its derivatives have been discussed in [37] and [42], respectively. Similar to the kernel embedding linear response in Section 4, one may replace p eq in (5) with the corresponding KDE estimate,p N , and obtain the KDE-based approximation of the linear response statistics. As opposed to the kernel embedding formulation in Section 4, the KDE-based estimator possesses the positivity but does not have the orthogonality of the basis functions. As a result, we do not need to propose any truncation to the sample points in the numerical implementation. However, each evaluation ofp N or its derivatives requires visiting K a total of N times, which makes the computation expensive.
As with the kernel regression in (58), the choice of kernel K is not crucial, but the choice of bandwidth h is important. For simplicity, throughout the section, we fix K to be the Gaussian kernel, K(x) = (2π) − d 2 exp − 1 2 x 2 , and set h to be the Silverman's bandwidth [43], where σ({X n }) denotes the standard deviation of the given sample {X n } N n=1 . While more elaborate bandwidth selection methods can be done (such as the cross validation), we do not pursue them due to the very slow computational time in computing the linear response statistics (see Table 1 for a wall-clock time in both examples below with N = 10 7 ).
For the triple-well model, the potential function contains a quadratic retaining potential, which introduces a Gaussian tail to the density function. Thus, we will consider a two-dimensional Mehler kernel in the kernel embedding linear response. For the Langevin equation, the marginal distribution of the velocity v is Gaussian, while the marginal distribution of the displacement x, governed by the Morse potential, is asymmetric. In computing the kernel embedding linear response, to obtain the best result, the kernel will be derived based on a tensor product of Hermite (for the variable v) and Laguerre (for the variable x) polynomials. 5.1. A gradient system with a triple-well potential. We first consider a twodimensional stochastic gradient system as follows, where W t is a two-dimensional Wiener process, and V , similar to the model in [20], is a triple-well potential, with v(z) = 10 exp 1 where χ (−a,a) denotes the characteristic function over the interval (−a, a). The parameter γ ∈ (0, 1) in (60) indicates the depth of the three wells. The additional quadratic term 0.8[(x 1 − a) 2 + (x 2 − a/ √ 3) 2 ] in the triple-well potential (60) is a smooth retaining potential (also known as a confining potential). The triple-well model (59), as a stochastic gradient system [39], yields an equilibrium distribution given by To derive a linear response operator, we consider an external forcing that is constant in x. Subsequently, the perturbed dynamics is given by, where |δ| 1. If we select A(x) := x as the observable, the corresponding linear response operator, as a 2-by-2 matrix-valued function given by (5), reads For the quadratic retaining potential in (60), p eq in (61) yields a Gaussian tail. Therefore, we apply the kernel embedding linear response to the linear response operator k A (t) in (62) based on the Hermite polynomials. Letp eq denote the kernel embedding estimate of p eq in (61), and by Eq. (41), the kernel embedding linear response operator, as an estimate of k A (t) in (62), is defined aŝ In the numerical experiment, we set (a, k B T, γ) = (1, 1.5, 0.25). To generate the data from (59), we apply the weak trapezoidal method [3] with step size h = 1 × 10 −3 , followed by a 1/5-subsample. Figure 2 shows the order-60 (we take β = 1 in (37), and M = 60 in (38)) kernel embedding estimatesp eq compared with the p eq in (61). The estimatep eq captures the well structure of the model with a decent accuracy. In this figure, we also see that the error in estimating k A decreases as a function of M . Figure 3 compares the linear response estimates in (62), obtained from the kernel embedding formulation and KDE, with the true linear response. Notice that kernel embedding with M = 60 produces more accurate estimates compared to the KDEbased estimator.
Since the unperturbed dynamics (59) is a stochastic gradient system, the offdiagonal entries of the linear response operator k A (t) in (62) should be zero at t = 0 due to the equipartition of the energy in statistical mechanics [57]. However, the empirical estimates of k A reported in Figure 3 shows a slight deviation from the exact value. This error can be attributed to the fact that our data is generated from a numerical discretization, which introduces an error in the equilibrium density and the time correlation [27]. Nevertheless, it is quite remarkable that the kernel embedding estimate still captures the true off-diagonal component profile, considering that the true response is relatively small itself. Here, the difference between the KDE-based estimates and the true response is more profound.

A Langevin equation with
Morse potential. For the second example, we consider a Langevin dynamics in statistical mechanics, describing the dynamics of a particle driven by a conservative force, a damping force, and a stochastic force. In particular, we choose the conservative force based on the Morse potential where the last quadratic term in U 0 , similar to the last term in the triple-well potential (60), acts as a retaining potential, preventing the particle from moving to infinity. For this one-dimensional model, we rescale the mass to unity, and write the dynamics as follows whereẆ t is a white noise. The smooth retaining potential U (x) guarantees the ergodicity of the Langevin system in (64) (see Appendix C of [57] for the proof, which is an application of the result in [34]). Namely, there is an equilibrium distribution (also known as the Gibbs measure) p eq (x, v), given by To derive a linear response operator in (5), we introduce a constant external forcing together with a constant external velocity field, and the corresponding perturbed system is given by By selecting the observable A = (x, v) , the resulting linear response operator is given by Here, to derive the 2-by-2 two-point statistics matrix in (66), we have used the fact that the variables x and v are independent at the equilibrium. However, in our estimation problem, we do not assume that the underlying density, p eq , is a product of two marginal densities. In particular, adopting the notation in Section 3.1.1 and 3.1.2, we define the order-(M 1 , M 2 ) kernel embedding estimate of p eq in (65) aŝ where denotes the normalized Laguerre polynomials with respect to the Gamma distribution G(x; 1) ∝ xe −x ; while {ψ m2 } denotes the normalized Hermite polynomials with respect to the Gaussian distribution W . The representation in (67) is motivated by the asymmetric structure of the marginal distribution of x (See Figure 4). To determine M 1 and M 2 , consider the estimation of the marginal distribution of x given bŷ which is a linear superposition of the orthogonal basis functions {  One can see that even with M 2 = 0, the error already converges, which is not surprising since v is Gaussian at the equilibrium. In particular, using Hermite polynomials, we reach a perfect fit at order-0, that is, M 2 = 0. With M 2 = 0, suggested by (67), our density estimation problem reduces to a one-dimensional problem in learning the marginal distribution of x. In the same plot, we show the error (in blue dashes) in the estimation of the marginal density of x, η M , as a function of M = M 1 for a fixed M 2 = 0. Notice that the error decreases as a function of M . In the same panel, we also plot the rejection probability, R M in (53), as a function of M = M 1 for a fixed M 2 = 0. Notice that while the rejection probability fluctuates, it eventually becomes very small as M > 90. In the right hand panel, we compare the marginal distribution of x estimated using the Laguerre polynomials via the kernel embedding and the KDE. A close inspection suggests that the kernel embedding produces a more accurate estimate. As for the Gaussian distributed variable, v, both estimators are comparably accurate (results are not shown).
With the estimatep eq of p eq , we define the corresponding kernel embedding linear response,k as the estimates of k A (t) in (66), which does not rely on the independence of x and v at the equilibrium.
In the numerical test, we set γ = 0.5, k B T = 1.0, = 0.2, a = 10, and x 0 = 0. The data, in the form of a time series, are generated from the model (64) using an operator-splitting method [51] with step size h = 1 × 10 −3 , followed by a 1/10subsample. Figure 4 presents the marginals of the kernel embedding estimateŝ p eq in (67) with M 1 = 90 and M 2 = 0. Notice that the domain of the Laguerre polynomials via Monte-Carlo, we have shifted the observations of x so that they are all positive. Figure 5 compares the linear response operator k A in (66) with its estimatesk A in (68) based on the kernel embedding linear response and KDE. We reach perfect fits in (1, 2) and (2, 2) components of k A since v is Gaussian at the equilibrium. In our Langevin dynamics, the damping coefficient γ = 0.5 in (64) is used, which is in the under-damped regime [57], in which case we have a relatively strong memory effect and the two-point statistics k A (t) in (66) exhibits a more complicated behavior. The numerical result suggests that the kernel embedding linear response estimate is more accurate compare to the KDE-based estimate. We report the elapsed wall-clock time of running the simulations based on MATLAB R using a desktop computer (equipped with a 3.2GHz quad-core Intel Core i5 processor with 32Gb RAM) in Table 1. Notice that since the complexity of the KDE method only depends on the sample size N = 10 7 and d = 2, the wallclock time for the KDE estimates in both examples above are similar. With much fewer basis functions, thanks to the orthogonality, the kernel embedding approach is much faster. We want to point out that, in practice, there are many ways to reduce the computational cost of KDE. Here, we followed the formulation in (58) for easy implementations. From these two examples, we can see that the kernel embedding implemented with appropriate orthogonal polynomials is more efficient and produces more accurate estimates with a relatively simple tuning procedure as proposed in (53)-(54). 6. Summary and discussion. In this paper, we considered estimating the equilibrium density of unperturbed dynamics from non-i.i.d observed time series. This problem arises from the estimation of linear response statistics. In particular, we employed a nonparametric density estimator formulated by the kernel embedding of distributions. We chose the corresponding hypothesis space (model) as an RKHS so  Table 1. Elapsed time (based on a desktop computer, equipped with a 3.2GHz quad-core Intel Core i5 processor with 32Gb RAM) of the KDE approach and the kernel embedding approach in computing the linear response statistics.
that the properties of the kernel can be carried over to the functions in the RKHS.
To avoid the computational expense that arises using radial type kernels, we considered the "Mercer-type" kernels constructed based on the classical orthogonal bases defined on non-compact domains, e.g., R d . Here, the orthogonality corresponds to a weighted-L 2 space, which naturally provides the coefficients' integral definition in the estimates. In practice, the number of bases involved, thanks to the orthogonality, are much fewer than the sample size. For example, in the test models, the sample size is of order 10 7 , while the number of bases is of order 10 3 or less.
We used the orthogonal polynomials, assigned with a specific power of the corresponding weight, to build a Mercer-type kernel corresponding to an RKHS. To overcome the difficulty caused by the non-compact domain, we showed that the boundedness of the kernel could be achieved either using the asymptotic behavior of a class of orthogonal polynomials (e.g., Hermite polynomials) or existing identities (e.g., Laguerre polynomials). By studying the orthogonal polynomial approximation in the RKHS setting, we established the estimator's uniform convergence, which justifies using the estimator for interpolation. With this choice of basis representation, we arrived at the well-known polynomial chaos approximation. However, the convergence is valid in both L 2 -norm sense and uniform-norm sense. An important issue that we addressed is the practical problem of choosing hypothesis space (or polynomials), which is critical for accurate estimations. Using the RKHS formulation, we generalized the theory of c 0 -universality to guarantee a consistent estimator with a hypothesis space constructed to respect the decaying property of the target function that can be empirically quantified from the available data.
In terms of linear response estimation, we defined the kernel embedding linear response based on the kernel embedding estimate of the equilibrium density. Our study provides practical conditions for the estimator's well-posedness and the well-posedness of the underlying response statistics. Given a well-posed estimator, supported by the theory of RKHS, we provided a theoretical guarantee for the convergence of the estimator to the underlying actual linear response statistics. Finally, since we approximate the coefficients in the estimates using a Monte-Carlo average, we derived a statistical error bound for the density estimation that accounts for the Monte-Carlo averaging over non-i.i.d time series with α-mixing property and biases due to the finite basis truncation. This error bound provides a means to understand the kernel's feasibility and limitation with the "Mercer-type" kernels (and specifically, polynomial chaos expansion).
Numerically, we validated the kernel embedding linear response estimator on two stochastic dynamics with known but non-trivial equilibrium densities. In the triple-well model, we explored the effectiveness of the kernel embedding estimate of a two-dimensional target density with Gaussian decay. In the Langevin model, the marginal distribution of the displacement, due to the Morse potential, is asymmetric (the decaying properties are different on two sides). We considered a hypothesis space (RKHS) based on a tensor product of Hermite (for the velocity) and Laguerre (for the displacement) polynomials in constructing the kernel. In both examples, we found that the proposed estimator is computationally more efficient and more accurate compared to the KDE-based linear response estimator.
Overall, the kernel embedding linear response provides a systematic and datadriven approach in computing the linear response statistics without knowing the explicit form of the underlying density function. However, this approach is still subjected to the curse of dimensionality due to the usage of the tensor product.
that is, f (n) → f † in H β . iii With all the preparation work we have done so far, the reproducing property (10) of H β becomes almost trivial. For any fixed x 0 ∈ R d , k β (·, x 0 ) ∈ H β . From (19), which means, ∀f ∈ H β , the inner product f, k β (·, x 0 ) is well-defined ∀x 0 ∈ R d .
In particular, we have which leads to the reproducing property.
Appendix C. Proof of Proposition 3. Here we discuss the proof of Proposition 3, which provides an error bound for the empirical kernel embedding estimate f M,N in (40). To begin with, let's briefly review the concept of α-mixing process and the corresponding Davydov's covariance inequality [11]. Given a stationary process {X n } ∞ n=1 , let A b a ( a < b ≤ ∞) denote the σ-algebra generated by {X n , a ≤ n ≤ b}. We say {X n } is α-mixing if where α(k) is known as the α-mixing coefficient. Such condition imposes restrictions on X n amounting to a weak interdependence of the beginning and the end of the process. Davydov's covariance inequality states that, |cov(h(X 1 ), h(X 1+k ))| ≤ 12α(k) 1/r (E [|h(X 1 ) where 1/r + 1/p + 1/q = 1 given E [|h(X 1 )| p ] and E [|h(X 1 )| q ] exist.
To bound the covariance term in (79), we apply the Davydov's covariance inequality (77) with h = h m . Set p = q = 2 + in (77) for ∈ (0, 2), we have where we have used Jensen's inequality in the last step.
As a result, the summation in Eq. (83) can be controlled by Finally, notice that for a positive integer j the total number of different d-dimensional multi-indices such that m 1 = j is j+d−1 d−1 , and j + d − 1 d − 1 = (j + 1)(j + 2) · · · (j + d − 1)