Randomized learning of the second-moment matrix of a smooth function

Consider an open set \begin{document}$ \mathbb{D}\subseteq\mathbb{R}^n $\end{document} , equipped with a probability measure \begin{document}$ \mu $\end{document} . An important characteristic of a smooth function \begin{document}$ f:\mathbb{D}\rightarrow\mathbb{R} $\end{document} is its second-moment matrix \begin{document}$ \Sigma_{\mu}: = \int \nabla f(x) \nabla f(x)^* \mu(dx) \in\mathbb{R}^{n\times n} $\end{document} , where \begin{document}$ \nabla f(x)\in\mathbb{R}^n $\end{document} is the gradient of \begin{document}$ f(\cdot) $\end{document} at \begin{document}$ x\in\mathbb{D} $\end{document} and \begin{document}$ * $\end{document} stands for transpose. For instance, the span of the leading \begin{document}$ r $\end{document} eigenvectors of \begin{document}$ \Sigma_{\mu} $\end{document} forms an active subspace of \begin{document}$ f(\cdot) $\end{document} , which contains the directions along which \begin{document}$ f(\cdot) $\end{document} changes the most and is of particular interest in ridge approximation. In this work, we propose a simple algorithm for estimating \begin{document}$ \Sigma_{\mu} $\end{document} from random point evaluations of \begin{document}$ f(\cdot) $\end{document} without imposing any structural assumptions on \begin{document}$ \Sigma_{\mu} $\end{document} . Theoretical guarantees for this algorithm are established with the aid of the same technical tools that have proved valuable in the context of covariance matrix estimation from partial measurements.


1.
Introduction. Central to approximation theory, machine learning, and computational sciences in general is the task of learning a function given its finitely many point samples. More concretely, consider an open set D ⊆ R n , equipped with where ∇f (x) ∈ R n is the gradient of f (·) at x ∈ D and the superscript * denotes vector and matrix transpose. 1 The [i, j]th entry of this matrix, namely Σ µ [i, j], measures the expected product between the ith and jth partial derivatives of f (·). Note that Σ µ captures key information about how f (·) changes along different directions. Indeed, for an arbitrary vector v ∈ R n with v 2 = 1, the directional derivative of f (·) at x ∈ D and along v is v * ∇f (x), and it is easy to check that the directional derivative of f (·) along v, itself a scalar function on D, has the average energy of v * Σ µ v with respect to the measure µ. The directions with the most energy, that is the directions along which f (·) changes the most on average, are particularly important in ridge approximation, where we are interested in approximating (the possibly complicated function) f (·) with a (simpler) ridge function. More specifically, the leading r eigenvectors of Σ µ span an r-dimensional active subspace of f (·) with respect to the measure µ [11], which contains the directions along which f (·) changes the most. If U µ,r ∈ R n×r denotes an orthonormal basis for this active subspace, then it might be possible to reliably approximate f (x) with h(U * µ,r x) for all x ∈ D and for some smooth function h : R r → R. In this sense, we might think of ridge approximation and active subspaces as the extensions of, respectively, dimensionality reduction and principal components to high-dimensional functions. Beyond approximation theory, the significance of second-moment matrices (and related concepts) across a number of other disciplines is discussed in Section 4. With this introduction, the main objective of this paper is the following, which will be made precise later in Section 2.
Objective: Design query points {x i } N i=1 and learn from {x i , f (x i )} N i=1 the second-moment matrix of f (·) with respect to the measure µ. We must emphasize that we impose no structural assumptions on the secondmoment matrix (such as being low rank or sparse), a point that we shall revisit later in Section 4. Our approach to this problem, alongside the results, is summarized next with minimal details for better accessibility. A rigorous account of the problem and our approach is then presented in Sections 2 and 3.
1.1. Approach. We assume in this paper that points in the domain D are observed randomly according to the probability measure µ. In particular, consider N random points drawn independently from µ and stored as the columns of a matrix X ∈ R n×N . It is then easy to verify [10] thaṫ is an unbiased estimator of Σ µ in (1). 2 In fact, a standard large deviation analysis reveals that Σ X − Σ µ ∝ 1 √ N , with overwhelming probability and for any matrix norm · .
Since we furthermore assume that only the point values of f (·) are at our disposal (rather than its gradients), it is not possible to directly calculateΣ X as in (2). Thus, one might resort to using finite difference approximations of the partial derivatives, as we sketch here and formalize in Section 2. Our procedure for estimating the second-moment matrix of f (·) will in fact rely not only on {x i , f (x i )} N i=1 but also on a supplementary set of points (also drawn randomly) nearby those in X. In particular, for a sufficiently small > 0 and arbitrary x, let B x, denote the Euclidean ball of radius about x, and set Let also µ X, be the conditional probability measure on B X, induced by µ. Consider N X, random points drawn independently from µ X, and stored as the columns of Y X, ∈ R n×N X, . Then partition Y X, according to X by setting Y x, = Y X, ∩ B x, , so that Y x, ∈ R n×Nx, contains all -neighbors of x in Y X, . This setup is illustrated in Figure 1.
For every x ∈ X, consider∇ Yx, f (x) ∈ R n as an estimate of the true gradient ∇f (x), where∇ the scaling with n will be shortly justified. Then we could naturally consideṙ Σ X,Y X, ∈ R n×n as an estimate ofΣ X in (2), and in turn an estimate of Σ µ in (1), whereΣ In general, however,Σ X,Y X, is a biased estimator ofΣ X and the importance of consistency in statistical learning motivated us to search for a better estimator, not unlike the approach taken in covariance matrix estimation [42,21], see also Section 1.2. Algorithm 1, in fact, introduces an estimate ofΣ X , denoted throughout byΣ X,Y X, , which has a smaller bias thanΣ X,Y X, . Indeed, Theorem 3.1 in Section 3, roughly speaking, 3 establishes that where the expectation is over X, Y X, and · F stands for the Frobenius norm. Throughout, we will use and similarly , ≈ to suppress universal constants and simplify the presentation. Above, the quantity B µ, depends in a certain way on the regularity of the measure µ and function f (·), with the dependence on f (·) Figure 1. Visualization of the problem setup. The probability measure µ is supported on the domain D ⊆ R n of a smooth function f (·). Here, N = 3 and X = {x i } N i=1 are drawn independently from µ. For sufficiently small , we let B xi, denote the -neighborhood of each x i and set B X, = ∪ N i=1 B xi, . On B X, , µ induces the conditional measure µ X, , from which N X, points are independently drawn and collected in Y X, = {y ij } i,j . Here, x 1 has N x1, = 2 neighbors in Y X, and we set Y x1, = {y 1,j } Nx 1 , j=1 . Similarly, Y x2, and Y x3, are formed. Note that Y X, = ∪ N i=1 Y xi, . Our objective is to estimate the second-moment matrix of f (·) (with respect to the probability measure µ) given {x i , f (x i )} and {y ij , f (y ij )}. suppressed as usual. Moreover, loosely speaking, it holds true that Σ X,Y X, − Σ µ F B µ, + n 2 + n N X, , with high probability, as described in Theorem 3.2 in Section 3. As a rule of thumb, (5) and (6) hold when N X, ≈ N log 2 N . Thus, it suffices to take only O(log 2 N ) samples within the -neighborhood of each of the N data points in X. As we discuss in Remark 5, the resulting convergence rate (as a function of the total number of samples) is then nearly the same as one achieves when perfect knowledge of the gradients is available and two-stage sampling is not required.
To verify the convergence rate from (6), we consider the following numerical example. For x ∈ [−1, 1] 500 (i.e., n = 500) and µ a uniform probability distribution on the hypercube, let f (x) = 1 2 x * Ax + b * x for a known symmetric matrix A ∈ R 500×500 and a known vector b ∈ R 500 . A quick calculation shows that Σ µ = 1 3 A 2 + bb * . For the simulation, we generate A and b randomly, and we estimate the relative error (in the Frobenius norm) in the approximationΣ X,Y X, from Algorithm 1. All reported results use = 10 −4 ; using 10 −2 and 10 −6 produced similar results. Each subfigure in Figure 2 shows results using a different value for N X,min, from the set {50, 200, 400, 550}, i.e., the minimum number of samples in each ball. Note that the first three values of N X,min, tested are significantly less than the dimension n of the space. In other words, each gradient approximation uses significantly fewer than • Draw N X, random points independently from µ X, and store them as the columns of Y X, ∈ R n×N X, . Here, µ X, is the conditional probability measure induced by µ on B X, = ∪ x∈X B x, . In turn, B x, ⊂ R n is the Euclidean ball of radius about x. Partition Y X, according to X by setting Y x, = Y X, ∩ B x, , so that Y x, ∈ R n×Nx, contains all -neighbors of x in Y X, .
• Compute and returnΣ where I n denotes the n × n identity matrix, anḋ 1.2. Contribution and organization. The main contribution of this paper is the design and analysis of a simple algorithm to estimate the second-moment matrix Σ µ of a smooth function f (·) from its point samples; see (1) and Algorithm 1. As argued earlier and also in Section 4, Σ µ is a key quantity in ridge approximation and a number of related problems.
The key distinction of this work is the lack of any structural assumptions (such as small rank or sparsity) on Σ µ ; mild assumptions on f are specified at the beginning of Section 2. Imposing a specific structure on Σ µ can lead to more efficient algorithms as we discuss in Section 4.  Figure 2. A simulation study using a 500-dimensional (n = 500) quadratic function for the relative error in the approximation (7) as a function of the number N X, of total samples. All simulations use = 10 −4 . Each subfigure uses a different value for the minimum number N X,min, of points in the -neighborhood of each center point in X, and varies the number N of centers. Each of the five groups of blue dots in each subfigure indicates a different number N . Within each group, there are ten replications. The slope of each line is −1/2, which verifies the expected convergence rate from (6).
At a very high level, there is indeed a parallel between estimating the secondmoment matrix of a function from random point samples and estimating the covariance matrix of a random vector; Algorithm 1 in a sense produces an analogue of the sample covariance matrix, adjusted to handle missing data [42]. In this context, more efficient algorithms are available for estimating, for example, the covariance matrix with a sparse inverse [21]. In this sense, we feel that this work fills an important gap in the literature of ridge approximation and perhaps dimensionality reduction by addressing the problem in more generality.
The rest of this paper is organized as follows. The problem of learning the second-moment matrix of a function is formalized in Section 2. Our approach to this problem, stated more formally, along with the theoretical guarantees, are described in Section 3. In Section 4, we sift through a large body of literature and summarize the relevant prior work. Proofs and technical details are deferred to Section 5 and the appendices.
2. Problem statement and approach. In this section, we formalize the problem outlined in Section 1. Consider an open set D ⊆ R n , equipped with subspace Borel σ-algebra and probability measure µ. We assume throughout that f : D → R is twice differentiable on D, and that where ∇f (x) ∈ R n and ∇ 2 f (x) ∈ R n×n are the gradient and Hessian of f (·) at x ∈ D, respectively, and we use the notation · 2 to denote both the 2 -norm of vectors and the spectral norm of matrices. Moreover, for > 0, let D ⊂ D denote the -interior of D, namely D = {x ∈ D : B x, ⊆ D}. Throughout, B x, ⊂ R n denotes the (open) Euclidean ball of radius centered at x. Consider Σ µ ∈ R n×n defined as where E x computes the expectation with respect to x ∼ µ. Our objective in this work is to estimate Σ µ . To that end, consider N random points drawn independently from µ and stored as the columns of X ∈ R n×N . Then, as noted in Section 1.1, it is easy to verify thatΣ is an unbiased estimator for Σ µ in (11). To interpret (12), recall also that we treat matrices and sets interchangeably throughout, slightly abusing the standard notation. In particular, x ∈ X can also be interpreted as x being a column of X ∈ R n×N . The following result quantifies how wellΣ X approximates Σ µ . Its proof is included in Appendix B for completeness; see [10] for related results concerning the accuracy ofΣ X as an estimate of Σ µ . Proposition 1. Let X ∈ R n×N contain N independent samples drawn from the probability measure µ. Then,Σ X is an unbiased estimator for Σ µ ∈ R n×n , see (11) and (12). Moreover, except for a probability of at most n −1 , it holds that Since only point values of f (·) are at our disposal, we cannot computeΣ X directly. Instead, we will systematically generate random points near the point cloud X and then estimateΣ X by aggregating local information, as detailed next.
Given the point cloud X ⊂ D, fix > 0, small enough so that X is a 2 -separated point cloud that belongs to the -interior of D. Formally, fix ≤ X , where denote the -neighborhood of the point cloud X. Consider the conditional probability measure on B X, described as For an integer N X, , draw N X, independent random points from µ X, and store them as the columns of Y X, ∈ R n×N X, . Finally, an estimate ofΣ X and in turn of Σ µ as a function of X, Y X, ⊂ D and evaluations of f (·) at these points is proposed byΣ X,Y X, in Algorithm 1.
3. Theoretical guarantees. Recalling (11) and (12), how well doesΣ X,Y X, in Algorithm 1 approximateΣ X and in turn Σ µ ? Parsing the answer requires introducing additional notation and imposing a certain regularity assumption on µ. All these we set out to do now, before stating the results in Section 3.2.
For each x ∈ X, let the columns of Y x, ∈ R n×Nx, contain the -neighbors of x in Y X, . In our notation, this can be written as Because ≤ X is small, see (14), these neighborhoods do not intersect, that is therefore, Y X, is simply partitioned into #X = N subsets {Y x, } x∈X . Observe also that, conditioned on x ∈ X and N x, , each neighbor y ∈ Y x, follows the conditional probability measure described as follows: 3.1. Regularity of µ. In order to introduce the regularity condition imposed on µ here, consider first the special case where the domain D ⊂ R n is bounded and µ is the uniform probability measure on D. Then, for > 0 and arbitrary -interior point x ∈ D , the conditional measure µ x, too is the uniform measure on B x, , see (18). Draw y from µ x, , that is, y|x ∼ µ x, in our notation. Then it is easy to verify that y − x is an isotropic random vector, namely for some factor C. 4 Above, I n ∈ R n×n is the identity matrix and E y|x [·] = E y [·|x] stands for conditional expectation, given x. A similar property plays an important role in this paper, as captured by Assumption 1 below.
Assumption 1. (Local near-isotropy of µ) Throughout this paper, we assume that there exist µ , K µ > 0 such that for all ≤ µ , the following requirement holds for any arbitrary -interior point x ∈ D .
Given x, draw y from the conditional measure on the -neighborhood of x, namely y|x ∼ µ x, with µ x, defined in (18). Then, for every γ 1 ≥ 0 and arbitrary (but fixed) v ∈ R n , it holds that where P x,y : is the orthogonal projection onto the direction of y − x. Above, Pr y|x [·] = Pr y [·|x] stands for conditional probability.
Roughly speaking, under Assumption 1, µ is locally isotropic. Indeed, this assumption is met when µ is the uniform probability measure on D, as shown in Appendix C. Moreover, Assumption 1 is not too restrictive. One would expect that a probability measure µ, if dominated by the uniform measure on D and with a smooth Radon-Nikodym derivative, satisfies Assumption 1 when restricted to sufficiently small neighborhoods.
Assumption 1 also controls the growth of the moments of P x,y v 2 [58, Lemma 5.5]. Finally, given the point cloud X, we also conveniently set (see Assumption 1 and (14)) (20) We are now in position to present the main results.

3.2.
Performance of Algorithm 1. With the setup detailed in Section 2, we now quantify the performance of Algorithm 1. In Theorems 3.1 and 3.2 below, for a fixed point cloud X, we focus on how well the output of Algorithm 1, namelyΣ X,Y X, , approximatesΣ X . Then, in the ensuing remarks, we remove the conditioning on X, using Proposition 1 to see how wellΣ X,Y X, approximates Σ µ . We now turn to the details. Theorem 3.1 below states thatΣ X,Y X, can be a nearly unbiased estimator oḟ Σ X given X, see (12). The proof is given in Section 5.1. Throughout, E z1|z2 [·] = E z1 [·|z 2 ] stands for conditional expectation over z 1 and conditioned on z 2 for random variables z 1 , z 2 .

Theorem 3.1. (Bias)
Consider an open set D ⊆ R n equipped with probability measure µ satisfying Assumption 1, and consider a twice differentiable function f : D → R satisfying (9,10). Assume that the columns of (fixed) X ∈ R n×N belong to D, namely X ⊂ D in our notation. Fix also ∈ (0, µ,X ], see (20). For an integer N and integers N X, ≥ N and N X,min, ≤ N X, , assume also that and N X,min, where ρ µ,X, : Then the output of Algorithm 1, namely the estimatorΣ X,Y X, defined in (7), satisfies where B µ, is given explicitly in (44).
In this theorem statement and throughout the paper, we use the notation log a b as shorthand for (log b) a . A few remarks are in order.
Remark 1. (Discussion) Theorem 3.1 describes how wellΣ X,Y X, approximateṡ Σ X , in expectation. To form a better understanding of this result, let us first study the conditions listed in (21).
• The quantity ρ µ,X, , defined in (22), reflects the non-uniformity of µ over the set D. In particular, if D ⊂ R n is bounded and µ is the uniform probability measure on D, then ρ µ,X, achieves its maximum possible value of 1. Nonuniform measures could yield ρ µ,X, < 1. • The requirements on N X, and N X,min, in (21) ensure that every x ∈ X has sufficiently many neighbors in Y X, , that is N x, is large enough for all x. For example, if µ is the uniform probability measure on D and ρ µ,X, = 1, we might take N X,min, ≈ log 2 N so that (23) holds with a total of N X, = O(N log 2 N ) samples. • The requirement that N X, ≥ n 1 20 in (21) is very mild and will be automatically satisfied in cases of interest, as we discuss below.
Let us next interpret the bound on the bias in (23).
• The first term on the right-hand side of (23), namely B µ, , is given explicitly in (44); it depends on both the probability measure µ and the function f (·), and it can also be viewed as a measure of the non-uniformity of µ. In fact, as explained in the proof of Theorem 3.1, in the special case where µ is the uniform probability measure on a bounded and open set D and every x ∈ X has the same number of neighbors N x, = N X, /N within Y X, , then conditioned on this event, (23) can in fact be sharpened by replacing the definition of B µ, in (44) simply with B µ, = 0. In general, the more isotropic µ is in the sense described in Assumption 1, the smaller B µ, will be. • The second term on the right-hand side of (23) is negligible, as we will generally have N growing at least with n 2 , as explained below. • The third and fourth terms on the right-hand side of (23) can be made arbitrarily small by choosing the neighborhood radius appropriately small (as a function of L f , H f , n, K µ , and N X, ). In computational applications, however, choosing too small could raise concerns about numerical precision. • To get a sense of when the bias in (23) is small relative to the size of Σ µ , it may be appropriate to normalize (23). A reasonable choice would be to divide both sides of (23) by L 2 f , where L f bounds ∇f (x) 2 on D; see (9). In particular, such a normalization accounts for the possible scaling behavior of Σ µ F if one were to consider a sequence of problems with n increasing. For example, in the case where n increases but the new variables in the domain of f (·) do not affect its value, then L 2 f and Σ µ F are both constant. On the other hand, in the case where n increases and f (·) depends uniformly on the new variables, then L 2 f and Σ µ F both increase with n. In any case, one can show that Σ µ F ≤ L 2 f . With this choice of normalization, the second, third, and fourth terms on the right-hand side of (23) can still be made arbitrarily small as described above. In the special case where µ is uniform on D and every x ∈ X has the same number of neighbors N x, = N X, /N , the first term on the right-hand side of (23) remains zero, as also described above. More generally, however, B µ, / L 2 f will contain a term that scales like √ n/N X,min, , and to control this term it is necessary to choose N X,min, n 1/2 log 2 N so that (21) is also satisfied. Notably, though, this method can be implemented when fewer than n neighbors are available for each x ∈ X, whereas estimating the local gradients via a conventional finite difference approximation would require n neighbors per point using deterministic queries. For Algorithm 1, we revisit the impact of n on the choices of N and N X, after presenting Theorem 3.2 below.
Remark 2. (Sampling strategy) In Algorithm 1, N X, points are independently drawn from the conditional probability measure on the -neighborhood of the point cloud X and then stored as the columns of Y X, , namely (see (16)) (24) This sampling strategy appears to best fit our fixed budget of N X, samples, as it "prioritizes" the areas of D with larger "mass." For example, suppose that µ(dx) µ(dx ) and suggesting that a larger weight should be placed on x rather than x when estimating Σ µ (see (11)). In the same scenario, assume naturally that µ(B x, ) µ(B x , ), so that it is more likely to sample from the neighborhood of x than x . Then, given a fixed budget of N X, samples, it is highly likely that N x, N x , . That is, x likely has far more -neighbors in Y X, compared to x . Loosely speaking then, the contribution of x toΣ X,Y X, is calculated more accurately than that of x . In other words, the sampling strategy used in Algorithm 1 indeed assigns more weight to areas of D with larger mass.
In some applications, however, sampling points according to the distribution µ X, may be a challenge. A rejection sampling strategy-where points are drawn i.i.d. from µ on D and those falling outside ∪ x∈X B x, are discarded-is one possibility but is not feasible in high dimensions. As an alternative, one can consider a twophase approach where first a ball B x, with x ∈ X is selected with probability proportional to µ(B x, ), and second a point is selected from the uniform measure within this ball. Such locally uniform sampling is an approximation to sampling from the distribution µ X, . We expect that similar performance bounds hold for this locally uniform sampling strategy-especially when is small-but we do not quantify this here.
Remark 3. (Proof strategy) At a high level, the analysis handles the possible non-uniformity of the measure µ and higher order terms in f (·) by introducing quantities that are simpler to work with but are similar toΣ X,Y X, . Moreover, if N X, is sufficiently large, then each x ∈ X has many neighbors in Y X, and this observation aids the analysis. The rest of the calculations, in effect, remove the estimation bias ofΣ X,Y X, in (4) to arrive atΣ X,Y X, .
Our second result, proved in Section 5.2, is a finite-sample bound forΣ X,Y X, .  N ≥ log(n), and N X, ≥ n, it holds that except with a probability of O n −3 + N −3 . Here, O(·) is the standard Big-O notation, the probability is with respect to the selection of Y X, conditioned on the fixed set X, and B µ, is given explicitly in (44).
Remark 4. (Discussion) Theorem 3.2 states thatΣ X,Y X, can reliably estimatė Σ X with high probability. We offer several remarks to help interpret this result.
• The conditions in (21) were discussed in Remark 1. The requirement that N X, ≥ n 1 20 in (21) has been strengthened to N X, ≥ n in the statement of Theorem 3.2. However, this will again be automatically satisfied in cases of interest, as we discuss below.
• Let us now dissect the estimation error, namely the right-hand side of (25). As discussed in Remark 1, B µ, in effect captures the non-uniformity of measure µ. In particular, the right-hand side of (25) can be sharpened by setting B µ, = 0 in the setting described in that remark. • Similar to Remark 1, the terms involving on the right hand side of (25) can be made negligible by choosing to be suitably small. We omit these terms in the discussion below. • The fourth term on the right hand side of (25) can be controlled by making N X, suitably large. We discuss this point further below. • The final term on the right hand side of (25) is negligible compared to the fourth, and we omit this in our discussion below.
and note that ρ µ,X, ≥ ρ µ, for any X; see (22). Now combining Theorem 3.2 with Proposition 1, removing the conditioning on X, and omitting the negligible terms yields with high probability when both X and Y X, are selected randomly, therefore quantifying how well the full algorithm in Algorithm 1 estimates the second-moment matrix of f (·). This conclusion holds for any value of small enough that (i) X ≥ (see (14)) with high probability over the random draw of X and (ii) the terms involving on the right hand side of (25) are made negligible. As suggested in Remark 1, we can normalize this bound by dividing both sides by L 2 f : We discuss the terms appearing on the right hand side of (27): • As described in Remark 1, in some settings B µ, / L 2 f will be zero, while in other settings controlling B µ, / L 2 f will require choosing N X,min, √ n log 2 N . • The second and third terms in (27) dictate the convergence rate of the error as the number of samples increases. In particular, setting N X, proportional to N log 2 (N ) gives N ≈ N X, / log 2 (N ) and an overall convergence rate (perhaps to a nonzero bias B µ, / L 2 f ) of log 4 (N X, )/ N X, as the number N X, of secondary samples (which dominates the total) increases. Up to logarithmic terms, this is the same as the convergence rate appearing in Proposition 1 where perfect knowledge of gradients was available.
• As a function of the ambient dimension n, the second term in (27) will dominate the third. Controlling the second term in (27) will require ensuring that N X, (and thus the overall number of samples) scales like n 2 , neglecting logarithmic factors.

Remark 6. (Proof strategy)
The estimation error here is decomposed into "diagonal" and "off-diagonal" terms. The diagonal term, we find, can be written as a sum of independent random matrices and controlled by applying a standard Bernstein inequality. The off-diagonal term, however, is a second-order chaos (a certain sum of products of random variables) and requires additional care.

Remark 7. (Possible improvements)
In combination with Weyl's inequality [4], (26) might be used to control the distance between the spectrum ofΣ X,Y X, and that of Σ µ . Likewise, given an integer r ≤ n, standard perturbation results [60] might be deployed to measure the principal angle between the span of the leading r eigenvectors ofΣ X,Y X, and an r-dimensional active subspace of f (·). To obtain the sharpest bounds, both these improvements would require controlling the spectral norm ofΣ X,Y X, − Σ µ rather than its Frobenius norm (which is bounded in Theorem 3.2 above). A detailed argument favoring the spectral norm in the context of active subspaces is also given in [38]. Controlling the spectral norm of the error however appears to be considerably more difficult. As an aside, let us point out that the spectrum of Σ µ in relation to f (·) has been studied in [19,57]. Another interesting question for future work is the extension of these results to the case where f (·) is vector-valued, rather than scalar-valued; see [64] and the references therein.
4. Related work. As argued in Section 1, the second-moment matrix (or its leading eigenvectors) is of particular relevance in the context of ridge approximation.
where A is an n × r matrix with r < n and h : R r → R. Such a function varies only along the r-dimensional subspace spanned by the columns of A and is constant along directions in the (n − r)-dimensional orthogonal complement of this subspace. A large body of work exists in the literature of approximation theory on learning ridge functions from point samples [46,15,9,54,24,34,26,43,7,35]. Most of these works focus on finding an approximation to the underlying function h and/or the dimensionality-reducing matrix A (or its column span). When f (·) is a ridge function, the r-dimensional column span of A coincides with the span of the eigenvectors of Σ µ , which will have rank r. This illuminates the connection between ridge approximation and second-moment matrices.
In [19], the authors develop an algorithm to learn the column span of A when its basis vectors are (nearly) sparse. The sparsity assumption was later removed in [57,5] and replaced with an assumption that this column span is low-dimensional (r is small). For learning such a low-dimensional subspace, these models allow for algorithms with better sample complexities compared to Theorem 3.2 which, in contrast, provides a guarantee on learning the entire second-moment matrix Σ µ and holds without any assumption (such as low rank) on Σ µ . In this sense, the present work fills a gap in the literature of ridge approximation; see also Section 1.2. For completeness, we note that it is natural to ask whether the results in [57] could simply be applied in the "general case" where the subspace dimension r approaches the ambient dimension n (thus relaxing the critical structural assumption in that work). As detailed in Section 5 of [57], however, the sampling complexity in this general case will scale with n 5 (ignoring log factors). In contrast, our bound (26) requires only that the total number of function samples N + N X, scale with n 2 .
A ridge-like function is one for which f (x) ≈ h(A * x). The framework of active subspaces provides a mechanism for detecting ridge-like structure in functions and reducing the dimensionality of such functions [10,11,12]. For example, in scientific computing f (x) may represent the scalar-valued output of some complicated simulation that depends on a high-dimensional input parameter x. By finding a suitable r × n matrix A, one can reduce the complexity of parameter studies by varying inputs only in the r-dimensional column space of A. The term active subspace refers to the construction of A via the r leading eigenvectors of Σ µ .
In high-dimensional statistics and machine learning, similar structures arise in the task of regression, where given a collection of data pairs (x i , z i ), the objective is to construct a function z = f (x) that is a model for the relationship between x and z. One line of work in this area is projection pursuit where, spurred by the interest in generalized additive models [30], the aim is to construct f (·) using functions of the form i h i (a * i x) [22,33,16]. Further connections with neural networks are studied in [45], [20,Chapter 11]. See also [59,56] for connections with Gaussian process regression and uncertainty quantification. Sufficient dimension reduction and related topics [39,63,13,62,23,52,32,25] are still other lines of related work in statistics. In this context, a collection of data pairs (x i , z i ) are observed having been drawn independently from some unknown joint density. The assumption is that z is conditionally independent of x, given A * x for some n × r matrix A. The objective is then to estimate the column span of A, known as the effective subspace for regression in this literature.
Finding the second-moment matrix of a function is also closely related to covariance estimation (see (1)), which is widely studied in modern statistics often under various structural assumptions on the covariance matrix, e.g., sparsity of its inverse [6,8,14,36,49]. In this context, it appears that [3,37,2,47] are the most relevant to the present work, in part because of their lack of any structural assumptions. For the sake of brevity, we focus on [3], which offers an unbiased estimator for the covariance matrix of a random vector x given few measurements of multiple realizations of x in the form of {Φ i x i } i for low-dimensional (and uniformly random) orthogonal projection matrices {Φ i } i . It is important to point out that, by design, the estimator in [3] is not applicable to our setup. 5 Our framework might be interpreted as sum of rank-1 projections. To further complicate matters, the probability measure µ on D is not necessarily uniform; we cannot hope to explicitly determine the distribution of the crucial components of the estimator. Instead, we rely on the standard tools in empirical processes to control the bounds. It is also worth including a few other works [42,41,27] which also involve covariance estimation from partially observed random vectors.
Yet another related field is matrix completion and recovery [50,51,18] and subspace estimation from data with erasures [17], where typically a low-rank structure is imposed. Lastly, in numerical linear algebra, random projections are increasingly used to facilitate matrix operations [53, 29,40]. As a result, a very similar mathematical toolbox is used in that line of research.

5.
Theory. This section contains the proofs of the two main results of this paper. 5.1. Proof of Theorem 3.1. Let us begin by outlining the proof strategy.
• First, we introduce a new quantity: ... ΣX,Y X, ∈ R n×n . Conditioned on a certain "good" event E 1 , ... ΣX,Y X, is easier to work with thanΣ X,Y X, . • Then, for fixed X ⊂ D , we define another "good" event E 2 where each x ∈ X has sufficiently many neighbors in Y X, . Lemma 5.2 below shows that E 2 is very likely to happen if N X, = #Y X, is large enough. Conditioned on the event E 2 , Lemma 5.3 below shows that ... ΣX,Y X, is a nearly unbiased estimator ofΣ X : • Lastly, we remove the conditioning on E 1 ∩ E 2 to complete the proof of Theorem 3.1. We now turn to the details and introduce ... ΣX,Y X, ∈ R n×n : ... ΣX,Y X, : Here,∇ and P x,y ∈ R n×n is the orthogonal projection onto the direction of y − x. In order to relate ... ΣX,Y X, toΣ X,Y X, , we invoke the following result, proved in Appendix E.
Moreover, consider the event for Q X, > 0 to be set later. Then, conditioned on the event E 1 , it holds that Thanks to Assumption 1, the event E 1 is very likely to happen for the right choice of Q X, . Indeed, if we set Q X, = γ 2 log N X, for γ 2 ≥ 1, then which follows from (19) and an application of the union bound (similar to the slightly more general result in Lemma I.3). Roughly speaking, in light of Lemma 5.1,Σ X,Y X, ≈ ... ΣX,Y X, . It therefore suffices to study the bias of ... ΣX,Y X, in the sequel. As suggested earlier, if #Y X, = N X, is sufficiently large, then every x ∈ X will likely have many neighbors in Y X, , namely #Y x, = N x, 1 for every x ∈ X. This claim is formalized below and proved in Appendix F.
Then, except with a probability of at most N 1−γ3 , it holds that To use Lemma 5.2 here, we proceed as follows. For γ 3 ≥ 1, suppose that and consider the event where, in particular, each x ∈ X has at least N X,min, neighbors in Y X, . In light of Lemma 5.2, E 2 is very likely to happen. To be specific, provided that where we conveniently defined Conditioned on the event E 2 , ... ΣX,Y X, in (29) takes the following simplified form: Using the above simplified form, we will prove the following result in Appendix G.
Roughly speaking it states that, conditioned on the event E 2 , ... ΣX,Y X, is a nearlyunbiased estimator ofΣ X . Lemma 5.3. Fix X and ∈ (0, µ,X ]. Then, it holds that where Moreover, suppose that µ is the uniform probability measure on D, and that N x, = N x , for every pair x, x ∈ X. Then, conditioned on E 2 , one can replace B µ, with 0, and thus ... ΣX,Y X, is an unbiased estimator ofΣ X . Next, we remove the conditioning on the event E 2 , with the aid of the following bounds: (12) and (9)) (see (30) and (9)) (see (29) and (9)) Then, we write that (45) and (39)) which, to reiterate, holds with N X,min, γ 2 3 log 2 N and under (40). Lastly, we reintroduceΣ X,Y X, by invoking Lemma 5.1 as follows: (46)) (34) ) .

5.2.
Proof of Theorem 3.2. At a high level, the proof strategy here matches that of Theorem 3.1. First, we replaceΣ X,Y X, with the simpler quantity ... ΣX,Y X, defined in (29). More specifically, in light of Lemma 5.1, it suffices to study ... ΣX,Y X, in the sequel.
Next, for N X,min, > 0 to be set later, recall the "good" event E 2 in (38) whereby every x ∈ X has at least N X,min, neighbors in Y X, . Conditioned on the event E 2 , ... ΣX,Y X, takes the simpler form of (42), using which we prove the following result in Appendix H.
Lemma 5.4. Fix X and ∈ (0, µ,X ]. If log(n) ≥ 1, N ≥ log(n), and log(N X, ) ≥ log(n), then conditioned on E 2 and X, it holds that for γ 7 ≥ 1 and γ 2 ≥ 3, except with a probability e −γ7 +n 2−log γ7 +N We next remove the conditioning on the event E 2 by letting R denote the right hand side of (49) and by writing that under (37). Lastly, we reintroduceΣ X,Y X, by invoking Lemma 5.1: it holds that with a failure probability of the order of assuming (37) holds and that log(n) ≥ 1, N ≥ log(n), and log(N X, ) ≥ log(n). Setting γ 2 = 4, γ 3 = 4, and γ 7 = 149 log(N ) and noting that N X, ≥ N completes the proof of Theorem 3.2.
6. Acknowledgments.  Appendix A. Toolbox. In this section, we list a few results that are repeatedly used in the rest of appendices. Recall the following inequalities for a random variable z and event A (with complement A C ): 6 We also recall the Bernstein inequality [28].

Proposition 2. (Bernstein inequality)
Let {A i } i be a finite sequence of zeromean independent random matrices, and set 6 To see why the first inequality holds, note that where 1 A (·) is the indicator function for the event A. It is easily verified that from which (54) follows immediately.

LEARNING OF THE SECOND-MOMENT MATRIX 351
Then, for γ ≥ 1 and except with a probability of at most e −γ , it holds that Appendix B. Proof of Proposition 1. Recalling the definition ofΣ X from (12), we write that which proves the first claim. To control the deviation about the mean, we will invoke the standard Bernstein inequality, recorded in Proposition 2 for the reader's convenience. Note thaṫ where {A x } x ⊂ R n×n are independent and zero-mean random matrices. To apply the Bernstein inequality (Proposition 2), we compute the parameters (9)) and (see (59) and #X = N ) (see (9)) and thus Therefore, for γ 4 ≥ 1 and except with a probability of at most e −γ4 , Proposition 2 dictates that which completes the proof of Proposition 1 when we take γ 4 = log n.
Appendix C. Uniform measure satisfies Assumption 1. We verify in this appendix that the uniform probability measure on D satisfies Assumption 1. Fix arbitrary > 0 and x in the -interior of D ⊆ R n , namely x ∈ D , assuming that D = ∅. The conditional measure in the neighborhood B x, too is uniform, so that y|x ∼ uniform(B x, ). Then, for fixed v ∈ R n with v 2 = 1, observe that To study the tail bound of the random variable P x,y v 2 2 , we proceed as follows. We note that (19) trivially holds for any γ 1 > n since P x,y v 2 2 ≤ v 2 2 because P x,y is an orthogonal projection. Thus, it suffices to consider fixed γ 1 ∈ (0, n]. Recalling the moments of the beta distribution, write that where is the beta function. Above, Γ(a) = ∞ 0 t a−1 e −t dt is the usual gamma function. In order to choose λ above, we rewrite (62) as Pr y|x P x,y v In order to minimize l(·), we compute its derivative: where ψ(a) = Γ (a) Γ(a) is the "digamma" function. It is well-known that ψ(a) ≈ log(a) for large a (see, for example, [44]). To guide our choice of λ, note that if n is sufficiently large and we take λ such that 1 λ n, we have that thereby suggesting the choice of λ = γ 1 /2. With this choice, we find that Therefore, Assumption 1 holds for the uniform probability measure with µ = ∞ and K µ = 1/2.
Appendix D. Estimating ∇f (x). For fixed x ∈ D, by drawing samples from the neighborhood of x and then applying the method of finite differences, we may estimate ∇f (x). This is described below for the sake of completeness.
Proposition 3. Fix x ∈ D and take > 0 small enough so that x belongs to -interior of D, namely x ∈ D . Draw y from the conditional measure on the neighborhood B x, , namely y|x ∼ µ x, (see (18)). For an integer N x, , let Y x, ⊂ B x, contain N x, independent copies of y. Then, it holds that In particular, if µ is the uniform probability measure on D, then B µ, = 0.
Proof. First, we replace∇ Yx, f (x) with the simpler quantity∇ Yx, f (x), defined as where P x,y ∈ R n×n is the orthogonal projection onto the direction of y − x. By definition, the two quantities are related as follows: (Taylor's expansion and (10)) Loosely speaking then,∇ Yx, f (x) ≈∇ Yx, f (x) and it therefore suffices to study the estimation bias of∇ Yx, f (x). To that end, we simply note that which, in turn, implies that (72) and (73)) In particular, when µ is the uniform probability measure on D, P x,y is an isotropic random matrix (for fixed x ∈ D). Therefore, E y|x [P x,y ] = C · I n for some scalar C.
To find C, we note that where we used the fact that P x,y is a rank-1 orthogonal projection. Consequently, when µ is the uniform measure, B µ, = 0. This completes the proof of Proposition 3.
Appendix E. Proof of Lemma 5.1. We only verify the second claim, as the other proof is similar. Conditioned on the event E 1 , note that Using the inequality aa * − bb * 2 ≤ a − b ( a 2 + b 2 ) for any a, b ∈ R n in the third line below, it follows that 1 N Nx, >N X,min, (72) and (75)) which, in turn, immediately implies that where the third line above uses the fact that |trace(A)| ≤ √ n A F for any A ∈ R n×n . Recall the definitions ofΣ X,Y X, and ... ΣX,Y X, in (7) and (29), respectively. Then, by combining (76) and (77), it follows that This completes the proof of Lemma 5.1.
Appendix F. Proof of Lemma 5.2. Our objective is to establish that, given X and neighborhood radius , each x ∈ X has many neighbors in Y X, provided that N X, = #Y X, is sufficiently large. To that end, we proceed as follows. Recall that µ X, is the conditional distribution on the -neighborhood of the point cloud X (see (16)). With y ∼ µ X, and for fixed x ∈ X, observe that y belongs to the -neighborhood of x (namely, y ∈ B x, ) with the following probability: Equivalently, the indicator function 1 y∈Bx, follows a Bernoulli distribution: Then, and, to investigate the concentration of N x, about its expectation, we write that where {a y } y are independent zero-mean random variables (for fixed x ∈ X). In order to apply the Bernstein's inequality (Proposition 2) to the last line of (82), we write that From Proposition 2, then, it follows that for γ 5 ≥ 1 and except with a probability of at most e −γ5 . Recall that #X = N . Then, an application of the union bound with the choice of γ 5 = γ 3 log N (with γ 3 ≥ 1) yields that except with a probability of at most N e −γ3 log N = N 1−γ3 . For the bound above to hold, we assume that N X, is sufficiently large (so that the requirement in (85) hold for every x ∈ X). In fact, if then (87) readily yields that except with a probability of at most N 1−γ3 . This completes the proof of Lemma 5.2.
Appendix G. Proof of Lemma 5.3. Throughout, X and ∈ (0, µ,X ] are fixed, and we further assume that the event E 2 holds (see (38)). For now, suppose in addition that the neighborhood structure N X, := {N x, } x∈X is fixed too. Recalling the definition of∇ Yx, f (·) from (30), we first set for short, and then separate the "diagonal" and "off-diagonal" components of the expectation of .... Σ X,Y X, as follows: x, y,y ∈Yx, x, y,y ∈Yx, x, y,y ∈Yx, x, y,y ∈Yx, The last line above uses the fact that distinct elements of Y x, are statistically independent. We next replace both the diagonal and off-diagonal components (namely, the first and second sums in the last line above) with simpler expressions. We approximate the diagonal term with another sum as follows: .
(see (38)) (92) To replace the off-diagonal term in the last line of (91), first recall the inequality and then note that x, y,y ∈Yx, x, y,y ∈Yx,  (9) and (70)) We may now replace the diagonal and off diagonal components in the last line of (91) with simpler expressions while incurring a typically small error. More specifically, in light of (92) and (94), (91) now implies that (92) and (94)) (95) We can further simplify the first line of (95) by replacing N x, with N X,min, as follows. By invoking (12) in the second line below, we note that Next, we replace trace[Σ X ] in the first line of (96) with trace[ .... Σ X,Y X, ]. To that end, we first notice the following consequence of (96): where the second line uses the fact that trace [I n ] = n. Also, the third line follows from the inequality |trace[A]| ≤ √ n A F for an arbitrary matrix A ∈ R n×n . After rearranging, (97) immediately implies that The above inequality enables us to remove trace[Σ X ] from the first line of (96): n N X,min, (n + 2) 1 + n 2 + n − 2 N X,min, (n + 2) −1 √ n 2 B µ, · I n F (96,98) = 1 2 1 + n 2 N X,min, (n + 2) 1 + n 2 + n − 2 N X,min, (n + 2) Lastly, (99) can be rewritten as follows by introducing ... ΣX,Y X, ∈ R n×n : (the factor in front of B µ, does not exceed 1) (100) where, above, we set ... ΣX,Y X, where the third identity uses (90) and the last line above follows from (38). Because B µ, does not depend on N X, , it is easy to remove the conditioning on N X, in (100): Consider also the following special case. Let µ be the uniform probability measure on D and fix x within the -interior of D, namely x ∈ D . Also draw y from µ x, , namely y|x ∼ µ x, (see (18)). Then, as stated in Proposition 3, B µ, = 0. Furthermore, it is known [3] that where ω follows the beta distribution, α is uniformly distributed on the unit sphere in R n−1 , and the two variables are independent, i.e., Finally, A ∈ R n×(n−1) in (103) is an orthonormal basis for the directions orthogonal to ∇f (x) ∈ R n , namely Using the expressions for the first and second moments of the beta distribution in the fourth line below, we write that and, consequently, B µ, = 0. Furthermore, assume that N x, = N x , for every pair x, x ∈ X. Then, we observe that the upper bound in (96) can be improved to B µ, = 0, namely ... ΣX,Y X, is an unbiased estimator ofΣ X , conditioned on the event E 2 . This completes the proof of Lemma 5.3. Appendix H. Proof of Lemma 5.4. Throughout, X is fixed and we assume that the event E 2 holds (see (38)). We also consider N X, = {N x, } x∈X (see (17)) to be any fixed neighborhood structure consistent with E 2 .
To bound the estimation error, we write that It therefore suffices to study the concentration of ... ΣX,Y X, about its expectation. In fact, as we show next, it is more convenient to first study the concentration of ....
Indeed, conditioned on E 2 , N X, , X, the expression for ... ΣX,Y X, in (29) simplifies to Consequently, the deviation of ... ΣX,Y X, about its expectation can be bounded as: Above, the first inequality uses (108). We also used the linearity of trace and the inequality | trace[A]| ≤ √ n A F for arbitrary A ∈ R n×n . Thanks to (109), it suffices to study the concentration of .... Σ X,Y X, about its expectation. The following result is proved in Appendix I.
Lemma H.1. Fix X and ∈ (0, µ,X ]. If log(n) ≥ 1, N ≥ log(n), and log(N X, ) ≥ log(n), then conditioned on E 2 , N X, , X, for γ 7 ≥ 1 and γ 2 ≥ 3, except with a probability of at most Combining (106), (109), and Lemma H.1 tells us that if log(n) ≥ 1, N ≥ log(n), and log(N X, ) ≥ log(n), then conditioned on E 2 , N X, , X, , except with the probability appearing in (111). We observe that this expression and probability do not depend on N X, , and so the same statement holds with the same probability when we condition only on E 2 , X. This completes the proof of Lemma 5.4.
Appendix I. Proof of Lemma H.1. Throughout, X is fixed and the event E 2 holds. We will also use N X, = {N x, } x∈X to summarize the neighborhood structure of data (see (17)). As in Appendix G, we again decompose .... Σ X,Y X, into "diagonal" and "off-diagonal" components: x, y,y ∈Yx, x, y,y ∈Yx, This decomposition, in turn, allows us to break down the error into the contribution of the diagonal and off-diagonal components: We bound the norms on the right-hand side above separately in Appendices J and K, respectively, and report the results below.
Lemma I.1. Fix X and ∈ (0, µ,X ]. Consider the event for Q X, > K −1 µ to be set later. Then, conditioned on E 2 , N X, , X, it holds that for γ 6 ≥ 1 and except with a probability of at most e −γ6 + Pr Y X, |E2,N X, ,X [E C 1 ].
Lemma I.2. Fix X and ∈ (0, µ,X ]. Let Y X, contain Y X, and three independent copies of it. That is, contains Y x, and three independent copies of it. Consider the event E 1 defined in (114) for Q X, > K −1 µ to be set later. Consider also the event Here, e i ∈ R n is the ith canonical vector. Assume that Pr Y X, |N X, ,E1,E2,X E C 3 log n N X,min, ρ µ,X, N X, and 1 ≤ log n ≤ N . Then, conditioned on E 2 , N X, , X, it holds that for γ 7 ≥ 1 and except with a probability of at most Before we can apply Lemmas I.1 and I.2 to the right-hand side of (113), however, we must show that the events E 1 and E 3 are very likely to happen. Owing to Assumption 1, this is indeed the case for the right choice of Q X, as shown in Appendix L and summarized below.
Revisiting (113), we put all the pieces together to conclude that if log(n) ≥ 1, N ≥ log(n), and log(N X, ) ≥ log(n), then conditioned on E 2 , N X, , X, (see Lemmas I.1 and I.2) (set γ 6 = γ 7 ; see Lemma I.3) except with a probability of at most e −γ6 + e −γ7 + n 2 · n − log γ7 + 2 Pr Appendix J. Proof of Lemma I.1. Throughout, X and the neighborhood structure N X, = {N x, } x∈X (see (17)) are fixed. Moreover, we assume that the event E 2 holds (see (38)). In addition, for Q X, ≥ K −1 µ to be set later, we condition on the following event: where {A x,y } x,y ⊂ R n×n are zero-mean independent random matrices. To bound this sum, we appeal to Proposition 2 by computing the b and σ parameters below.
For arbitrary x ∈ X and y ∈ Y x, , note that Kµ n (see [58, Lemma 5.5]) Kµ n (see (9)) =: On the other hand, note that x∈X y∈Yx, (see [58, Lemma 5.5]) where the second line uses (123). The third line above uses the fact that for a random matrix Z. It follows that Thus, in light of Proposition 2, and conditioned on E 1 , E 2 , N X, , X, it follows that for γ 6 ≥ 1 and except with a probability of at most e −γ6 . Before we can remove the conditioning on the event E 1 , we use the law of total expectation to write Since for any X, Y X, , we have we conclude that (triangle inequality and (129)) Lastly, we remove the conditioning on the event E 1 as follows: The proof of Lemma I.1 is now complete.
Appendix K. Proof of Lemma I.2. Throughout, X and the neighborhood structure N X, = {N x, } x∈X (see (17)) are fixed. Moreover, we assume that the event E 2 holds (see (38)). In addition, for Q X, ≥ K −1 µ to be set later, we condition on E 1 as defined in (122) after which we will remove the conditioning on E 1 . Above, y s |E 1 , x s is distributed according to the restriction of µ xs, to the event E 1 . In the following subsections, we separately bound each of the three norms in the last line above.
K.1. First norm . In this section, we bound the first norm in the last line of (132) by writing it as a chaos random variable. Let us first write that · P xs,y sk − E ys|E1,xs [P xs,ys ] ∇f (x s )∇f (x s ) * P xs,y sl − E ys|E1,xs [P xs,ys ] F =: Above, we also conveniently defined the matrices {A skl } s,k,l ⊂ R n×n as for every s ∈ [1 : N ] and k, l ∈ [1 : N xs, ]. By their definition above, the random matrices {A skl } s,k,l enjoy the following properties: where we used the fact that N X, = N s=1 N xs, to calculate the dimensions of A ij . In particular, (135) implies that where, ignoring the standard convention, we indexed the entries of A ij so that sk corresponds to the kth row of the sth block (and hence does not stand for the product of s and k). With this new notation, we revisit (133) to write that (140) • Second we use Markov's inequality to find a tail bound for a ij (given its moments).
Each step is discussed in a separate subsection below.
K.1.2. Moments of a ij . In order to control the moments of a ij , we take the following steps: • symmetrization, • decoupling, • modulation with Rademacher sequences, and finally • bounding the moments of the resulting decoupled chaos random variable.
Each of these steps is detailed in a separate paragraph below. Symmetrization. To control the moments of a ij , we first use a symmetrization argument as follows. With s ∈ [1 : N ] and conditioned on x s and E 1 , let Y i xs, ∈ R n×Nx s , be an independent copy of Y xs, . Then note that where we defined the block-diagonal matrix B ij ∈ R N X, ×N X, such that xs, · e * i P xs,y sk − P xs,y i sk ∇f (x s )∇f (x s ) * · P xs,y sl − P xs,y i sl e j , s = t and k = l, (143) Our next step is to decouple the sum in the last line of (141). Decoupling. Let Ξ = {ξ sk } s,k (with s ∈ [1 : N ] and k ∈ [1 : N xs, ]) be a sequence of independent standard Bernoulli random variables: each ξ sk independently takes one and zero with equal probabilities. We will shortly use the following simple observation: We now revisit (141) and write that (B ij [sk, sk] = 0, and (144)) where the last line above uses the Jensen's inequality. In particular, there must exist Ξ 0 = {ξ 0sk } s,k that exceeds the expectation in the last line above, so that Let {Y i i X, , Y i i i X, } ⊂ R n×#N X, be an independent copy of {Y X, , Y i X, }. For the sake of brevity, we will use the following short hand: Equipped with the construction above, we revisit (146) and write that where we added two zero expectation terms in the last equality above. The last line above uses independence and Jensen's inequality. Above, we also defined the block-diagonal matrix B ij ∈ R N X, ×N X, such that xs, · e * i P xs,y sk − P xs,y i sk ∇f (x s )∇f (x s ) * · P xs,y i i sl − P xs,y i i i sl e j , s = t and k = l, Note that (151) The next step is to modulate the sum in the last line of (148) with a Rademacher sequence.
Modulation with Rademacher Sequences. Fix i, j ∈ [1 : n], and recall the definitions of C ij ∈ R N X, ×N X, from (149). Let H = {η sk } s,k (with s ∈ [1 : N ] and k ∈ [1 : N xs, ]) be a Rademacher sequence, that is {η sk } s,k are independent Bernoulli random variables taking ±1 with equal chances. Also let H i = {η i sk } s,k be an independent copy of H. Then, we argue that where A ∞ is the largest entry of A in magnitude. With e i ∈ R n denoting the ith canonical vector, we continue by noting that √ p · N Q 2 X, L 2 f n 2 N X,min, · ρ µ,X, N X, .
(see (38) and (22)) (167) Conditioned on N X, , the bound above is independent of Y X, , which allows us to remove the conditioning and find that E p Y X, ,H,H i |E3,N X, ,E1,E2,X [c ij ] √ p · N Q 2 X, L 2 f n 2 N X,min, · ρ µ,X, N X, .
As a useful aside, we also record a uniform bound on c ij for every i, j ∈ [1 : n]: K.1.4. Applying the union bound. In light of (172) and by applying the union bound to {a ij } i,j , we arrive at the following statement.