Adaptive random Fourier features with Metropolis sampling

The supervised learning problem to determine a neural network approximation $\mathbb{R}^d\ni x\mapsto\sum_{k=1}^K\hat\beta_k e^{{\mathrm{i}}\omega_k\cdot x}$ with one hidden layer is studied as a random Fourier features algorithm. The Fourier features, i.e., the frequencies $\omega_k\in\mathbb{R}^d$, are sampled using an adaptive Metropolis sampler. The Metropolis test accepts proposal frequencies $\omega_k'$, having corresponding amplitudes $\hat\beta_k'$, with the probability $\min\big\{1, (|\hat\beta_k'|/|\hat\beta_k|)^\gamma\big\}$, for a certain positive parameter $\gamma$, determined by minimizing the approximation error for given computational work. This adaptive, non-parametric stochastic method leads asymptotically, as $K\to\infty$, to equidistributed amplitudes $|\hat\beta_k|$, analogous to deterministic adaptive algorithms for differential equations. The equidistributed amplitudes are shown to asymptotically correspond to the optimal density for independent samples in random Fourier features methods. Numerical evidence is provided in order to demonstrate the approximation properties and efficiency of the proposed algorithm. The algorithm is tested both on synthetic data and a real-world high-dimensional benchmark.


Introduction
We consider a supervised learning problem from a data set {x n , y n } ∈ R d × R, n = 1, . . . , N with the data independent identically distributed (i.i.d.) samples from an unknown probability distributionρ(dxdy). The distributionρ is not known a priori but it is accessible from samples of the data. We assume that there exists a function f : R d → R such that y n = f (x n ) + ξ n where the noise is represented by iid random variables ξ n with E[ξ n ] = 0 and E[ξ 2 n ] = σ 2 ξ . We assume that the target function f (x) can be approximated by a single layer neural network which defines an approximation β : R d × R Kd × C K → R (1) β(x; ω,β) = K k=1β k s(ω k , x) , where we use the notation for the parameters of the networkβ = (β 1 , . . . ,β K ) ∈ C K , ω = (ω 1 , . . . , ω K ) ∈ R Kd . We consider a particular activation function that is also known as Fourier features s(ω, x) = e iω·x , for ω ∈ R d , x ∈ R d .
Since the distributionρ is not known in practice the minimization problem is solved for the empirical risk ℓ(β(x n ;β, ω), y n ) .
The network is said to be over-parametrized if the width K is greater than the number N of training points, i.e., K > N . We shall assume a fixed width such that N > K when we study the dependence on the size of training data sets. We focus on the reconstruction with the regularized least squares type risk function ℓ(β(x n ;β, ω), y n ) = |y n − β(x n ;β, ω)| 2 + λ K k=1 |β k | 2 .
The least-square functional is augmented by the regularization term with a Tikhonov regularization parameter λ ≥ 0. For the sake of brevity we often omit the argumentsβ, ω and use the notation β(x) for β(x;β, ω). We also use |β| 2 := K k=1 |β k | 2 for the Euclidean norm on C K . To approximately reconstruct f from the data based on the least squares method is a common task in statistics and machine learning, cf. [15], which in a basic setting takes the form of the minimization problem (4) min represents an artificial neural network with one hidden layer. Suppose we assume that the frequencies ω are random and we denote E ω [g(ω, x, y)] := E[g(ω, x, y) | x, y] the conditional expectation with respect to the distribution of ω conditioned on the data (x, y). Since a minimum is always less than or equal to its mean, there holds (6) min The minimization in the right hand side of (6) is also known as the random Fourier features problem, see [10,5,14]. In order to obtain a better bound in (6) we assume that ω k , k = 1, . . . , K are i.i.d. random variables with the common probability distribution p(ω)dω and introduce a further minimization (7) min p E ω min β∈C K Eρ[|y n − β(x n ;β, ω)| 2 ] + λ|β| 2 An advantage of this splitting into two minimizations is that the inner optimization is a convex problem, so that several robust solution methods are available. The question is: how can the density p in the outer minimization be determined?
The goal of this work is to formulate a systematic method to approximately sample from an optimal distribution p * . The first step is to determine the optimal distribution. Following Barron's work [4] and [8], we first derive in Section 2 the known error estimate (8) E ω min based on independent samples ω k from the distribution p. Then, as in importance sampling, it is shown that the right hand side is minimized by choosing wheref is the Fourier transform of f . Our next step is to formulate an adaptive method that approximately generates independent samples from the density p * , thereby following the general convergence (8). We propose to use the Metropolis sampler: • given frequencies ω = (ω 1 , . . . , ω k ) ∈ R Kd with corresponding amplitudesβ = (β 1 , . . . ,β k ) ∈ C K a proposal ω ′ ∈ R Kd is suggested and corresponding amplitudesβ ′ ∈ C K determined by the minimum in (7), then • the Metropolis test is for each k to accept ω ′ k with probability min(1, |β ′ k | γ /|β k | γ ). The choice of the Metropolis criterion min(1, |β ′ k | γ /|β k | γ ) and selection of γ is explained in Remark 3.2. This adaptive algorithm (Algorithm 1) is motivated mainly by two properties based on the regularized empirical measureβ(ω) := K k=1β k φ ε (ω − ω k ) related to the amplitudesβ, where φ ε (ω) = (2πε 2 ) −d/2 e −|ω| 2 /(2ε 2 ) : (a) The quantities K pβ converge tof in L 1 asymptotically, as K → ∞ and ε → 0+, as shown in Proposition 3.1. For the proof of Proposition 3.1 we consider a simplified setting where the support of the x-data is all of R d .
(b) Property (a) implies that the optimal density p * will asymptotically equidistribute |β|, i.e., |β| be- The proposed adaptive method aims to equidistribute the amplitudes |β k |: if |β k | is large more frequencies will be sampled Metropolis-wise in the neighborhood of ω k and if |β k | is small then fewer frequencies will be sampled in the neighborhood. Algorithm 1 includes the dramatic simplification to compute all amplitudes in one step for the proposed frequencies, so that the computationally costly step to solve the convex minimization problem for the amplitudes is not done for each individual Metropolis test. A reason that this simplification works is the asymptotic independence Kp|β| → |f | shown in Proposition 3.1. We note that the regularized amplitude measureβ is impractical to compute in high dimension d ≫ 1. Therefore Algorithm 1 uses the amplitudesβ k instead and consequently Proposition 3.1 serves only as a motivation that the algorithm can work.
In some sense, the adaptive random features Metropolis method is a stochastic generalization of deterministic adaptive computational methods for differential equations where the optimal efficiency is obtained for equidistributed error indicators, pioneered in [2]. In the deterministic case, additional degrees of freedom are added where the error indicators are large, e.g., by subdividing finite elements or time steps. The random features Metropolis method analogously adds frequency samples where the indicators |β k | are large.
A common setting is to, for fixed number of data point N , find the number of Fourier features K with similar approximation errors as for kernel ridge regression. Previous such results on the kernel learning improving the sampling for random Fourier features are presented, e.g., in [3], [17], [9] and [1]. Our focus is somewhat different, namely for fixed number of Fourier features K find an optimal method by adaptively adjusting the frequency sampling density for each data set. In [17] the Fourier features are adaptively sampled based on a density parametrized as a linear combination of Gaussians. The work [3] and [9] determine the optimal density as a leverage score for sampling random features, based on a singular value decomposition of an integral operator related to the reproducing kernel Hilbert space, and formulates a method to optimally resample given samples. Our adaptive random feature method on the contrary is not based on a parametric description or resampling and we are not aware of other non parametric adaptive methods generating samples for random Fourier features for general kernels. The work [1] studies how to optimally choose the number of Fourier features K for a given number of data points N and provide upper and lower error bounds. In addition [1] presents a method to effectively sample from the leverage score in the case of Gaussian kernels.
We demonstrate computational benefits of the proposed adaptive algorithm by including a simple example that provides explicitly the computational complexity of the adaptive sampling Algorithm 1. Numerical benchmarks in Section 5 then further document gains in efficiency and accuracy in comparison with the standard random Fourier features that use a fixed distribution of frequencies.
Although our analysis is carried for the specific activation function s(ω, x) = e iω·x , thus directly related to random Fourier features approximations, we note that in the numerical experiments (see Experiment 5 in Section 5) we also tested the activation function often used in the definition of neural networks and called the sigmoid activation. With such a change of the activation function the concept of sampling frequencies turns into sampling weights. Numerical results in Section 5 suggest that Algorithm 1 performs well also in this case. A detailed study of a more general class of activation functions is subject of ongoing work.
Theoretical motivations of the algorithm are given in Sections 2 and 3. In Section 3 we formulate and prove the weak convergence of the scaled amplitudes Kβ. In Section 2 we derive the optimal density p * for sampling the frequencies, under the assumption that ω k , k = 1, . . . , K are independent andf ∈ L 1 (R d ). Section 4 describe the algorithms. Practical consequences of the theoretical results and numerical tests with different data sets are described in Section 5.

Optimal frequency distribution
2.1. Approximation rates using a Monte Carlo method. The purpose of this section is to derive a bound for (9) E ω min and apply it to estimating the approximation rate for random Fourier features. The Fourier transformf has the inverse representation We assume {ω 1 , . . . , ω k } are independent samples from a probability density p : R d → [0, ∞). Then the Monte Carlo approximation of this representation yields the neural network approximation f (x) ≃ α(x, ω) with the estimator defined by the empirical average To asses the quality of this approximation we study the variance of the estimator α(x, ω). By construction and i.i.d. sampling of ω k the estimator is unbiased, that is and we defineα .
Using this Monte Carlo approximation we obtain a bound on the error which reveals a rate of convergence with respect to the number of features K.
If there is no measurement error, i.e., σ 2 ξ = 0 and y n = f (x n ), then Proof. Direct calculation shows that the variance of the Monte Carlo approximation satisfies and since a minimum is less than or equal to its average we obtain the random feature error estimate in the case without a measurement error, i.e., σ 2 ξ = 0 and y n = f (x n ), Including the measurement error yields after a straightforward calculation an additional term

2.2.
Comments on the convergence rate and its complexity. The bounds (14) and (13) reveal the rate of convergence with respect to K. To demonstrate the computational complexity and importance of using the adaptive sampling of frequencies we fix the approximated function to be a simple Gaussian and we consider the two cases σ > √ 2 and 0 < σ ≪ 1. Furthermore, we choose a particular distribution p by assuming the frequencies ω k , k = 1, . . . , K from the standard normal distribution ω k ∼ N (0, 1) (i.e., the Gaussian density p with the mean zero and variance one).
Example I (large σ) In the first example we assume that σ > √ 2, thus the integral R d |f (ω)| 2 p(ω) dω is unbounded. The error estimate (14) therefore indicates no convergence. Algorithm 1 on the other hand has the optimal convergence rate for this example.
Example II (small σ) In the second example we choose 0 < σ ≪ 1 thus the convergence rate 1 The purpose of the adaptive random feature algorithm is to avoid the large factor O(σ −d ).
To have the loss function bounded by a given tolerance TOL requires therefore that the non-adaptive random feature method uses K ≥ TOL −1 , and the computational work to solve the linear least squares problem is with In contrast, the proposed adaptive random features Metropolis method solves the least squares problem several times with a smaller K to obtain the bound TOL for the loss. The number of Metropolis steps is asymptotically determined by the diffusion approximation in [11] and becomes proportional to γd σ −2 . Therefore the computational work is smaller TOL −3 O(γd σ −2 ) for the adaptive method.

2.3.
Optimal Monte Carlo sampling. This section determines the optimal density p for independent Monte Carlo samples in (12) by minimizing, with respect to p, the right hand side in the variance estimate (12).
is the solution of the minimization problem Define for any v : R d → R and ε close to zero At the optimum we have Consequently the optimal density becomes We note that the optimal density does not depend on the number of Fourier features, K, and the number of data points, N , in contrast to the optimal density for the least squares problem (28) derived in [3].
As mentioned at the beginning of this section sampling ω k from the distribution p * (ω)dω leads to the tight upper bound on the approximation error in (14).

Asymptotic behavior of amplitudesβ k
The optimal density p * = |f |/ f L 1 (R d ) can be related to data as follows: by considering the problem (9) and letting ζ(x) = K k=1ζ k e iω k ·x be a least squares minimizer of (18) min the vanishing gradient at a minimum yields the normal equations, for ℓ = 1, . . . , K, Thus if the x n -data points are distributed according to a distribution with a density ρ : and the normal equations can be written in the Fourier space as Given a sequence of samples {ω k } ∞ k=1 drawn independently from a density p we impose the following assumptions: (i) there exists a constant C such that We note that (A4) almost follows from (A3), since that implies that the density p has bounded first moment. Hence the law of large numbers implies that with probability one the sequence {ω k } ∞ k=1 is dense in the support of p. In order to treat the limiting behaviour ofẑ K as K → ∞ we introduce the empirical measure Thus we have for ℓ = 1, . . . , K so that the normal equations (20) take the form By the assumption (A1) the empirical measures are uniformly bounded in the total variation norm We note that by (A3) the measuresẐ K in R d have their support in U. We obtain the weak convergence result stated as the following Proposition.
Proof. To simplify the presentation we introducê The proof consists of three steps: 1. compactness yields a L 1 convergent subsequence of the regularized empirical measures {χ Kj ,ε } ∞ j=1 , 2. the normal equation (20) implies a related equation for the subsequence limit, and 3. a subsequence of the empirical measures converges weakly and as ε → 0+ the limit normal equation establishes (25).
Step 1. As φ ε are standard mollifiers, we have, for a fixed ε > 0, that the smooth functions (we omit ε in the notationχ K,ε )χ By compactness, see [6], there is a L 1 (V) converging subsequence of functions {χ K }, i.e.,χ K →χ in L 1 (V) as K → ∞. Since as a consequence of the assumption (A3) we have that suppẐ K ⊂ U for allẐ K , and hence suppχ K ⊂ V for allχ K , then the limitχ has its support in V. Henceχ can be extended to zero on R d \ V. Thus we obtain Step 2. The normal equations (24) can be written as a perturbation of the convergence (26) using that we have Thus we re-write the term ρ( now considering a general point ω instead of ω l and the change of measure fromẐ K toχ K , and by Taylor's theorem since by assumption the set {ω k } ∞ k=1 is dense in the support of p. Since λζ ℓ → 0, as K → ∞ by assumption (A2), the normal equation (24) implies that the limit is determined by We have here used that the function f ρ is continuous asρ ∈ C 1 , and the denseness of the sequence {ω k } ∞ k=1 .
Step 3. From the assumption (A1) allẐ K are uniformly bounded in the total variation norm and supported on a compact set, therefore there is a weakly converging subsequenceẐ K ⇀Ẑ, i.e., for all g ∈ C 1 (V) This subsequence ofẐ K can be chosen as a subsequence of the converging sequenceχ K . Consequently we have lim As ε → 0 + inẐ * φ ε we obtain by (27) and we conclude, by the inverse Fourier transform, that for f ∈ C 0 (R d ) and ρ in the Schwartz class. If the support of ρ is R d we obtain thatẐ =f ∈ L 1 (R d ).
The approximation in Proposition 3.1 is in the sense of the limit of the large data set, N → ∞, which impliesβ =ζ K . Then by the result of the proposition the regularized empirical measure forζ K , namelŷ which shows that Kp(ω k )β k converges weakly tof (ω k ) as K → ∞ and we have |f (ω k )| = p * (ω k ) f L 1 (R d ) . We remark that this argument gives heuristic justification for the proposed adaptive algorithm to work, in particular, it explains an idea behind the choice of the likelihood ratio in Metropolis accept-reject criterion.
Remark 3.2. By Proposition 3.1 K p(ω k )β k converges weakly tof (ω k ) as K → ∞. If it also converged strongly, the asymptotic sampling density for ω in the random feature Metropolis method would satisfy p = (C|f |) γ /p γ which has the fixed point solution p = (C|f |) γ γ+1 . As γ → ∞ this density p approaches the optimal |f |/ f L 1 (R d ) . On the other hand the computational work increases with larger γ, in particular the number of Metropolis steps is asymptotically determined by the diffusion approximation in [11] and becomes inversely proportional to the variance of the target density, which now depends on γ. If the target density is where λ 1 and λ 2 are positive constants with λ 2 > f p L ∞ (R d ) , and this additional penalty yields a least squares problem with the same accuracy as (14), since for the optimal solution the penalty vanishes.
The other assumption (A2), which is used to obtain λ|ζ ℓ | → 0 as K → ∞, can be removed by letting λ tend to zero slowly as K → ∞, since then by (A1) we obtain

Description of algorithms
In this section we formulate the adaptive random features Algorithm 1, and its extension, Algorithm 2, which adaptively updates the covariance matrix when sampling frequencies ω. Both algorithms are tested on different data sets and the tests are described in Section 5.
Before running Algorithm 1 or 2 we normalize all training data to have mean zero and component wise standard deviation one. The normalization procedure is described in Algorithm 3.
A discrete version of problem (4) can be formulated, for training data {(x n , y n )} N n=1 , as the standard least squares problem (28) min where S ∈ C N ×K is the matrix with elements S n,k = e iω k ·xn , n = 1, ..., N , k = 1, ..., K and y = (y 1 , . . . , y N ) ∈ R N . Problem (28) has the corresponding linear normal equations (29) (S T S + λN I)β = S T y which can be solved e.g. by singular value decomposition if N ∼ K or by the stochastic gradient method if N ≫ K, cf. [16] and [15]. Other alternatives when N ≫ K are to sample the data points using information about the kernel in the random features, see e.g. [3]. Here we do not focus on the interesting and important question of how to optimally sample data points. Instead we focus on how to sample random features, i.e., the frequencies ω k . In the random walk Metropolis proposal step in Algorithm 1 the vector r N is a sample from the standard multivariate normal distribution. The choice of distribution to sample r N from is somewhat arbitrary and not always optimal. Consider for example a target distribution in two dimensions that has elliptically shaped level surfaces. From a computational complexity point of view one would like to take different step lengths in different directions.
The computational complexity reasoning leads us to consider to sample r N from a multivariate normal distribution with a covariance matrix C t adaptively updated during the M iterations. The general idea of adaptively updating a covariance matrix during Metropolis iterations is not novel. A recursive algorithm for adaptively updating the covariance matrix is proposed and analysed in [7] and further test results are presented in [13].
In Algorithm 1 we choose the initial step length value δ = 2.4 2 /d which is motivated for general Metropolis sampling in [12]. When running Algorithm 1 the value of δ can be adjusted as a hyperparameter depending on the data.
In Algorithm 2 there is the hyperparameter ω max which defines the maximum radius of frequencies ω that can be sampled by Algorithm 2. In some problems the sampling of frequencies will start to diverge unless ω max is finite. This will typically happen if the frequency distribution is slowly decaying as |ω| → ∞.
In the convergence proof of [7] the probability density function of the distribution to be sampled from is required to have compact support. In practice though when applying Algorithm 2 we notice that a normal distribution approximates compactness sufficiently well to not require ω max to be finite. Another approach is to let ω max be infinity and adjust the hyperparamerters δ and M so that the iterations does not diverge from the minima. All hyperparameters are adjusted to minimize the error computed on a validation set disjoint from the training and test data sets.
With adaptive covariance combined into Algorithm 1 we sample r N from N (0,C) where initiallyC is the identity matrix in R d×d . After each iteration i = 1, 2, ..., M the covarianceC ′ of all previous frequencies ω j k , k = 1, 2, ..., K, j < i is computed. After t 0 iterations we updateC with the value ofC ′ . We present adaptive covariance applied to Algorithm 1 in Algorithm 2.

Numerical tests
We demonstrate different capabilities of the proposed algorithms with three numerical case studies. The first two cases are regression problems and the third case is a classification problem. For the regression problems we also show comparisons with the stochastic gradient method.
The motivation for the first case is to compare the results of the algorithms to the estimate (14) based . Both Algorithm 1 and Algorithm 2 approximately sample the optimal distribution but for example a standard random Fourier features approach with p ∼ N (0, 1) does not.
Another benefit of Algorithm 1 and especially Algorithm 2 is the efficiency in the sense of computational complexity. The purpose of the second case is to study the development of the generalization error over actual time in comparison with a standard method which in this case is an implementation of the stochastic Algorithm 2 Adaptive random Fourier features with Metropolis sampling and adaptive covariance Input: {(x n , y n )} N n=1 {data} Output: x → K k=1β k e iω k ·x {random features} Choose a sampling time T , a proposal step length δ, an exponent γ (see Remark 3.2), a Tikhonov parameter λ, a burn in time t 0 for the adaptive covariance, a maximum frequency radius ω max and a numberŇ of β updates M ← integer part (T /δ 2 ) ω ← the zero vector in R Kd β ← minimizer of the problem (28) given ω

Algorithm 3 Normalization of data
gradient method. The problem is in two dimensions which imposes more difficulty in finding the optimal distribution compared to a problem in one dimension.
In addition to the regression problems in the first two cases we present a classification problem in the third case. It is the classification problem of handwritten digits, with labels, found in the MNIST database. For training the neural network we use Algorithm 1 and compare with using naive random Fourier features. The purpose of the third case is to demonstrate the ability of Algorithm 1 to handle non synthetic data. In the simulations for the first case we perform five experiments: • Experiment 1: The distribution of the frequencies ω ∈ R Kd is obtained adaptively by Algorithm 1.
• Experiment 2: The distribution of the frequencies ω ∈ R Kd is obtained adaptively by Algorithm 2.
• Experiment 3: The distribution of the frequencies ω ∈ R Kd is fixed and the independent components ω k are sampled from a normal distribution. • Experiment 4: Both the frequencies ω ∈ R Kd and the amplitudesβ ∈ C K are trained by the stochastic gradient method. • Experiment 5: The ω ∈ R Kd weight distribution is obtained adaptively by Algorithm 1 but using the sigmoid activation function. For the second case we perform Experiment 1-4 and in simulations for the third case we perform Experiment 1 and Experiment 3. All chosen parameter values are presented in Table 2.
We denote by S test ∈ CÑ ×K the matrix with elements e iω k ·xn . The test data {(x n ,ỹ n ) | n = 1, ...,Ñ } are i.i.d. samples from the same probability distribution and normalized by the same empirical mean and standard deviation as the training data {(x n , y n ) | n = 1, . . . , N }. In the computational experiments we compute the generalization error as We denote by σ K the empirical standard deviations of the generalization error, based onM = 10 independent realizations for each fixed K and let an error bar be the closed interval The purpose of Experiment 5 is to demonstrate the possibility of changing the activation function x → e iω·x to the sigmoid activation function x → 1 1+e −ω·x when running Algorithm 1. With such a change of activation function the concept of sampling frequencies turns into sampling weights. In practice we add one dimension to each x-point to add a bias, compensating for using a real valued activation function and we set the value of the additional component to one. Moreover, we change S n,k = e iω k ·xn in (28) to S n,k = 1 1+e −ω k ·xn . Case 1: Target function with a regularised discontinuity. This case tests the capability of Algorithm 1 and Algorithm 2 to approximately find and sample frequencies ω from the optimal distribution p * = |f |/ f L 1 (R) . The target function where a = 10 −3 and is the so called Sine integral has a Fourier transform that decays slowly as ω −1 up to |ω| = 1/a = 1000. The target function f is plotted in Figure 1, together with f evaluated in N x-points from a standard normal distribution over an interval chosen to emphasize that N points is enough to resolve the local oscillations near x = 0.
The Fourier transform of f is approximated by computing the fast Fourier transform of f evaluated in 2N equidistributed x-points in the interval [−2π, 2π]. The inset in Figure 1 presents the absolute value of the fast Fourier transform of f where we can see that the frequencies drop to zero at approximately |ω| = 1/a = 10 3 .
We generate training data and test data as follows. First sample N x-points from N (0, 1). Then evaluate the target function in each x-point to get the y-points and run Algorithm 3 on the generated points to get the normalized training data {x n , y n } N n=1 and analogously the normalized test data {x n ,ỹ n } n=1 . We run Experiment 1-5 on the generated data for different values of K and present the resulting generalization error dependence on K, with error bars, in Figure 2. The triangles pointing to the left represent generalization errors produced from a neural network trained by Algorithm 1 and the diamonds by Algorithm 2. The stars correspond to the stochastic gradient descent with initial frequencies from N (0, 50 2 ) while the circles also corresponds to the stochastic gradient descent but with initial frequencies from N (0, 1). The squares correspond to the standard random Fourier features approach sampling frequencies ω k from N (0, 1). The triangles pointing down represent Algorithm 1 but for the sigmoid activation function. Algorithm 1 show a constant slope with respect to K. Although the generalization error becomes smaller for the stochastic gradient descent as the variance for the initial frequencies increase to 50 2 from 1, it stagnates as K increases. For a given K one could fine tune the initial frequency distribution for the stochastic gradient descent but for Algorithm 1 and Algorithm 2 no such tuning is needed. The specific parameter choices for each experiment are presented in Table 2.
Case 2: A high dimensional target function. The purpose of this case is to test the ability of Algorithm 1 to train a neural network in a higher dimension. Therefore we set d = 5. The data is generated analogously to how it is generated in Case 1 but we now use the target function f : where a = 10 −1 . We run Experiment 1, 3 and 4 and the resulting convergence plot with respect to the number of frequencies K is presented in Figure 3.
In Figure 3 we can see the expected convergence rate of O(K −1/2 ) for Algorithm 1. The Stochastic gradient method gives an error that converges as O(K −1/2 ) for smaller values of K but stops improving for approximately K > 128 for the chosen number of iterations.   which, as well asf , has elliptically shaped level surfaces. To find the optimal distribution p * which is N (0, diag([32 −2 , 32 2 ])) thus requires to find the covariance matrix diag([32 −2 , 32 2 ]).
The generation of data is done as in Case 1 except that the non normalized x-points are independent random vectors in R 2 with independent N (0, 1) components. We fix the number of nodes in the neural network to K = 256 and compute an approximate solution to the problem (29) by running Experiment 1, 2 and 4.
Convergence of the generalization error with respect to time is presented in Figure 4 where we note that both Algorithm 1 and Algorithm 2 produce faster convergence than the stochastic gradient descent. For the stochastic gradient method the learning rate has been tuned for a benefit of the convergence rate in the generalization error but the initial distribution of the components of ω simply chosen as the standard normal distribution.

Case 4:
The MNIST data set. The previously presented numerical tests have dealt with problems of a pure regression character. In contrast, we now turn our focus to a classification problem of handwritten digits.
The MNIST data set consists of a training set of 60000 handwritten digits with corresponding labels and a test set of 10000 handwritten digits with corresponding labels.
As a comparison we also use frequencies from the standard normal distribution and from the normal distribution N (0, 0.1 2 ) but otherwise solve the problem the same way as described in this case.
The error is computed as the percentage of misclassified digits over the test set {(x n ; (ỹ 0 n ,ỹ 1 n , ...,ỹ 9 n ))}Ñ n=1 . We present the results in Table 1 and Figure 5 where we note that the smallest error is achieved when frequencies are sampled by using Algorithm 1. When sampling frequencies from the standard normal distribution, i.e, N (0, 1), we do not observe any convergence with respect to K. Comparison between adaptively computed distribution of frequencies ω k and sampling a fixed (normal) distribution. 5.1. Optimal distribution p * . In this numerical test we demonstrate how the average generalization error depends on the distribution p for a particular choice of data distribution. We recall the frequencies ω = (ω 1 , . . . , ω K ) with (ω k ) i.i.d. from the distribution p. In the experiment the data (x n , y n ) N n=1 are given by y n = e −|xn| 2 /2 + ǫ n . We let the components ω k be sampled independently from N (0, σ 2 ω ) and monitor the results for different values of σ ω . Note that Algorithm 1 would approximately sample ω from the optimal density p * , which in this case correspond to the standard normal, i.e., ω k ∼ N (0, 1), see (16) .
The results are depicted in Figure 6 where we observe that the generalization error is minimized for σ ω ≈ 1 which is in an agreement with the theoretical optimum.