# American Institute of Mathematical Sciences

2020, 2(3): 309-332. doi: 10.3934/fods.2020014

## Adaptive random Fourier features with Metropolis sampling

 1 Department of Mathematics, Royal Institute of Technology, 100 44, Stockholm, Sweden 2 H-Ai AB, Grevgatan 23,114 53, Stockholm, Sweden 3 Department of Mathematical Sciences, University of Delaware, Newark, DE 19716, USA

* Corresponding author: Aku Kammonen

Received  July 2020 Revised  August 2020 Published   2020 Early access  September 2020

Fund Project: The research was supported by Swedish Research Council grant 2019-03725. The work of P. P. was supported in part by the ARO Grant W911NF-19-1-0243

The supervised learning problem to determine a neural network approximation $\mathbb{R}^d\ni x\mapsto\sum_{k = 1}^K\hat\beta_k e^{{{\mathrm{i}}}\omega_k\cdot x}$ with one hidden layer is studied as a random Fourier features algorithm. The Fourier features, i.e., the frequencies $\omega_k\in\mathbb{R}^d$, are sampled using an adaptive Metropolis sampler. The Metropolis test accepts proposal frequencies $\omega_k'$, having corresponding amplitudes $\hat\beta_k'$, with the probability $\min\big\{1, (|\hat\beta_k'|/|\hat\beta_k|)^{\gamma}\big\}$, for a certain positive parameter ${\gamma}$, determined by minimizing the approximation error for given computational work. This adaptive, non-parametric stochastic method leads asymptotically, as $K\to\infty$, to equidistributed amplitudes $|\hat\beta_k|$, analogous to deterministic adaptive algorithms for differential equations. The equidistributed amplitudes are shown to asymptotically correspond to the optimal density for independent samples in random Fourier features methods. Numerical evidence is provided in order to demonstrate the approximation properties and efficiency of the proposed algorithm. The algorithm is tested both on synthetic data and a real-world high-dimensional benchmark.

Citation: Aku Kammonen, Jonas Kiessling, Petr Plecháč, Mattias Sandberg, Anders Szepessy. Adaptive random Fourier features with Metropolis sampling. Foundations of Data Science, 2020, 2 (3) : 309-332. doi: 10.3934/fods.2020014
##### References:

show all references

##### References:
Case 1: Graph of the target function $f$ with sampled data set points $(x_n,y_n)$ marked (red on-line). The inset shows $|\hat f|$ of its Fourier transform and the detail of its behaviour at the origin
Case 1: The same data are shown in all figures with each of the six different experiments highlighted (blue on-line)
Case 2: The figure illustrates the generalization error with respect to $K$ for a target function in dimension $d = 5$
Case 3: The figure illustrates the generalization error over time when the approximate problem solution is computed using Algorithm 1, Algorithm 2 and stochastic gradient descent
Case 4: Dependence on $K$ of the misclassification percentage in the MNIST
The generalization error as a function of the standard deviation $\sigma_\omega$ of $p({\omega})$
Summary of numerical experiments together with their corresponding parameter choices
 Regression Case Case 1: a regularised step function Case 2: a high dimensional function Purpose Find $p$ such that the constant $\mathbb{E}_\omega[\frac{|\hat f(\omega)|^2}{(2\pi)^{d}p^2(\omega)}]$ does not become large Study the ability of Alg. 1 to find a high dimensional dimensional function Target $f(x)$ $\text{Si}\left(\frac{x}{a}\right)e^{-\frac{x^2}{2}}$ $a = 10^{-3}$ $\text{Si}\left(\frac{x_1}{a}\right)e^{-\frac{|x|^2}{2}}$ $a = 10^{-1}$ $d$ $1$ $5$ $K$ $2^i, \, i = 1, 2, ..., 11$ $2^i, \, i = 1, 2, ..., 10$ Experiment Exp. 1 Exp. 2 Exp. 3 Exp. 4 Exp. 5 Exp. 1 Exp. 4 Method Alg. 1 Alg. 2 RFF$^{1}$ SGM Alg. 1 tabular Alg. 1 SGM $N$ $10^4$ $10^4$ $10^4$ $10^4$ $10^4$ $10^4$ $10^4$ $\tilde{N}$ $10^4$ $10^4$ $10^4$ $10^4$ $10^4$ $10^4$ $10^4$ $γ$ $3d-2$ $3d-2$ $3d-2$ $3d-2$ $\lambda$ $0.1$ $0.1$ $0.1$ $0$ $0.1$ $0.1$ 0 $M$ $10^3$ $5000$ N/A $10^7$ $10^4$ $2.5\times 10^3$ $10^7$ $\bar{M}$ $10$ $10$ $10$ $10$ $10$ $10$ $10$ $\delta$ $2.4^2/d$ $0.1$ $2.4^2/d$ $\frac{2.4^2}{10d}$ $\Delta t$ $\mathtt{1.5e-4}$ $\mathtt{3.0e-4}$ $t_0$ $M/10$ $ω_{\text{max}}$ $\infty$ $m$ $10$ $50$ $100$ $25$ Regression Classification Case Case 3: anisotropic Gaussian function Case 4: The MNIST data set Purpose Computational complexity comparison Study the ability of Alg. 1 to work with non synthetic data Target $f(x)$ $e^{-(32 x_1)^2/2}e^{-(32^{-1} x_2)^2/2}$ $d$ $2$ $784$ $K$ $256$ $2^i, \, i = 1, 2, ..., 13$ Experiment Exp. 1 Exp. 2 Exp. 4 Exp. 1 Exp. 3 Method Alg. 1 Alg. 2 SGM Alg. 1 RFF$^{2}$ $N$ $10^4$ $10^4$ $3\times 10^7$ $6\times 10^4$ $6 \times 10^4$ $\tilde{N}$ $10^4$ $10^4$ $10^4$ $10^4$ $10^4$ $γ$ $3d-2$ $3d-2$ $3d-2$ $\lambda$ $0.1$ $0.1$ $0$ $0.1$ $0.1$ $M$ $10^4$ $10^4$ $3\times 10^7$ $10^2$ $\bar{M}$ $1$ $1$ $1$ $1$ $1$ $\delta$ $0.5$ $0.1$ $0.1$ $\Delta t$ $\mathtt{1.5e-3}$ $t_0$ $M/10$ $ω_{\text{max}}$ $\infty$ $m$ $100$ $100$ $M+1$ $^1$ – Random Fourier features with frequencies sampled from the fixed distribution $\mathcal{N}(0,1)$ $^2$ – Random Fourier features with frequencies sampled from the fixed distribution $\mathcal{N}(0,1)$, or $\mathcal{N}(0,0.1^2)$
 Regression Case Case 1: a regularised step function Case 2: a high dimensional function Purpose Find $p$ such that the constant $\mathbb{E}_\omega[\frac{|\hat f(\omega)|^2}{(2\pi)^{d}p^2(\omega)}]$ does not become large Study the ability of Alg. 1 to find a high dimensional dimensional function Target $f(x)$ $\text{Si}\left(\frac{x}{a}\right)e^{-\frac{x^2}{2}}$ $a = 10^{-3}$ $\text{Si}\left(\frac{x_1}{a}\right)e^{-\frac{|x|^2}{2}}$ $a = 10^{-1}$ $d$ $1$ $5$ $K$ $2^i, \, i = 1, 2, ..., 11$ $2^i, \, i = 1, 2, ..., 10$ Experiment Exp. 1 Exp. 2 Exp. 3 Exp. 4 Exp. 5 Exp. 1 Exp. 4 Method Alg. 1 Alg. 2 RFF$^{1}$ SGM Alg. 1 tabular Alg. 1 SGM $N$ $10^4$ $10^4$ $10^4$ $10^4$ $10^4$ $10^4$ $10^4$ $\tilde{N}$ $10^4$ $10^4$ $10^4$ $10^4$ $10^4$ $10^4$ $10^4$ $γ$ $3d-2$ $3d-2$ $3d-2$ $3d-2$ $\lambda$ $0.1$ $0.1$ $0.1$ $0$ $0.1$ $0.1$ 0 $M$ $10^3$ $5000$ N/A $10^7$ $10^4$ $2.5\times 10^3$ $10^7$ $\bar{M}$ $10$ $10$ $10$ $10$ $10$ $10$ $10$ $\delta$ $2.4^2/d$ $0.1$ $2.4^2/d$ $\frac{2.4^2}{10d}$ $\Delta t$ $\mathtt{1.5e-4}$ $\mathtt{3.0e-4}$ $t_0$ $M/10$ $ω_{\text{max}}$ $\infty$ $m$ $10$ $50$ $100$ $25$ Regression Classification Case Case 3: anisotropic Gaussian function Case 4: The MNIST data set Purpose Computational complexity comparison Study the ability of Alg. 1 to work with non synthetic data Target $f(x)$ $e^{-(32 x_1)^2/2}e^{-(32^{-1} x_2)^2/2}$ $d$ $2$ $784$ $K$ $256$ $2^i, \, i = 1, 2, ..., 13$ Experiment Exp. 1 Exp. 2 Exp. 4 Exp. 1 Exp. 3 Method Alg. 1 Alg. 2 SGM Alg. 1 RFF$^{2}$ $N$ $10^4$ $10^4$ $3\times 10^7$ $6\times 10^4$ $6 \times 10^4$ $\tilde{N}$ $10^4$ $10^4$ $10^4$ $10^4$ $10^4$ $γ$ $3d-2$ $3d-2$ $3d-2$ $\lambda$ $0.1$ $0.1$ $0$ $0.1$ $0.1$ $M$ $10^4$ $10^4$ $3\times 10^7$ $10^2$ $\bar{M}$ $1$ $1$ $1$ $1$ $1$ $\delta$ $0.5$ $0.1$ $0.1$ $\Delta t$ $\mathtt{1.5e-3}$ $t_0$ $M/10$ $ω_{\text{max}}$ $\infty$ $m$ $100$ $100$ $M+1$ $^1$ – Random Fourier features with frequencies sampled from the fixed distribution $\mathcal{N}(0,1)$ $^2$ – Random Fourier features with frequencies sampled from the fixed distribution $\mathcal{N}(0,1)$, or $\mathcal{N}(0,0.1^2)$
Case 4: The table shows the percentage of misclassified digits in the MNIST test data set for different values of $K$. Comparison between adaptively computed distribution of frequencies $\omega_k$ and sampling a fixed (normal) distribution
 K Fixed Fixed Adaptive $\omega\sim\mathcal{N}(0, 1)$ $\omega\sim\mathcal{N}(0, 0.1^2)$ 2 89.97% 81.65% 80.03% 4 89.2% 70.65% 65.25% 8 88.98% 55.35% 53.44% 16 88.69% 46.77% 36.42% 32 88.9% 30.43% 23.52% 64 88.59% 19.73% 16.98% 128 88.7% 13.6% 11.13% 256 88.09% 10.12% 7.99% 512 88.01% 8.16% 5.93% 1024 87.34% 6.29% 4.57% 2048 86.5% 4.94% 3.5% 4096 85.21% 3.76% 2.74% 8192 83.98% 3.16% 1.98%
 K Fixed Fixed Adaptive $\omega\sim\mathcal{N}(0, 1)$ $\omega\sim\mathcal{N}(0, 0.1^2)$ 2 89.97% 81.65% 80.03% 4 89.2% 70.65% 65.25% 8 88.98% 55.35% 53.44% 16 88.69% 46.77% 36.42% 32 88.9% 30.43% 23.52% 64 88.59% 19.73% 16.98% 128 88.7% 13.6% 11.13% 256 88.09% 10.12% 7.99% 512 88.01% 8.16% 5.93% 1024 87.34% 6.29% 4.57% 2048 86.5% 4.94% 3.5% 4096 85.21% 3.76% 2.74% 8192 83.98% 3.16% 1.98%
 [1] Feng Bao, Thomas Maier. Stochastic gradient descent algorithm for stochastic optimization in solving analytic continuation problems. Foundations of Data Science, 2020, 2 (1) : 1-17. doi: 10.3934/fods.2020001 [2] Leong-Kwan Li, Sally Shao. Convergence analysis of the weighted state space search algorithm for recurrent neural networks. Numerical Algebra, Control & Optimization, 2014, 4 (3) : 193-207. doi: 10.3934/naco.2014.4.193 [3] Meiyu Sui, Yejuan Wang, Peter E. Kloeden. Pullback attractors for stochastic recurrent neural networks with discrete and distributed delays. Electronic Research Archive, 2021, 29 (2) : 2187-2221. doi: 10.3934/era.2020112 [4] King Hann Lim, Hong Hui Tan, Hendra G. Harno. Approximate greatest descent in neural network optimization. Numerical Algebra, Control & Optimization, 2018, 8 (3) : 327-336. doi: 10.3934/naco.2018021 [5] Leong-Kwan Li, Sally Shao, K. F. Cedric Yiu. Nonlinear dynamical system modeling via recurrent neural networks and a weighted state space search algorithm. Journal of Industrial & Management Optimization, 2011, 7 (2) : 385-400. doi: 10.3934/jimo.2011.7.385 [6] Stephen Coombes, Helmut Schmidt, Carlo R. Laing, Nils Svanstedt, John A. Wyller. Waves in random neural media. Discrete & Continuous Dynamical Systems, 2012, 32 (8) : 2951-2970. doi: 10.3934/dcds.2012.32.2951 [7] Ying Sue Huang. Resynchronization of delayed neural networks. Discrete & Continuous Dynamical Systems, 2001, 7 (2) : 397-401. doi: 10.3934/dcds.2001.7.397 [8] Ruoxia Li, Huaiqin Wu, Xiaowei Zhang, Rong Yao. Adaptive projective synchronization of memristive neural networks with time-varying delays and stochastic perturbation. Mathematical Control & Related Fields, 2015, 5 (4) : 827-844. doi: 10.3934/mcrf.2015.5.827 [9] Pierre Guiraud, Etienne Tanré. Stability of synchronization under stochastic perturbations in leaky integrate and fire neural networks of finite size. Discrete & Continuous Dynamical Systems - B, 2019, 24 (9) : 5183-5201. doi: 10.3934/dcdsb.2019056 [10] Quan Hai, Shutang Liu. Mean-square delay-distribution-dependent exponential synchronization of chaotic neural networks with mixed random time-varying delays and restricted disturbances. Discrete & Continuous Dynamical Systems - B, 2021, 26 (6) : 3097-3118. doi: 10.3934/dcdsb.2020221 [11] Yunsai Chen, Zhao Yang, Liang Ma, Peng Li, Yongjie Pang, Xin Zhao, Wenyi Yang. Efficient extraction algorithm for local fuzzy features of dynamic images. Discrete & Continuous Dynamical Systems - S, 2019, 12 (4&5) : 1311-1325. doi: 10.3934/dcdss.2019090 [12] Xianmin Geng, Shengli Zhou, Jiashan Tang, Cong Yang. A sufficient condition for classified networks to possess complex network features. Networks & Heterogeneous Media, 2012, 7 (1) : 59-69. doi: 10.3934/nhm.2012.7.59 [13] Xiaming Chen. Kernel-based online gradient descent using distributed approach. Mathematical Foundations of Computing, 2019, 2 (1) : 1-9. doi: 10.3934/mfc.2019001 [14] Ting Hu. Kernel-based maximum correntropy criterion with gradient descent method. Communications on Pure & Applied Analysis, 2020, 19 (8) : 4159-4177. doi: 10.3934/cpaa.2020186 [15] Shishun Li, Zhengda Huang. Guaranteed descent conjugate gradient methods with modified secant condition. Journal of Industrial & Management Optimization, 2008, 4 (4) : 739-755. doi: 10.3934/jimo.2008.4.739 [16] Wataru Nakamura, Yasushi Narushima, Hiroshi Yabe. Nonlinear conjugate gradient methods with sufficient descent properties for unconstrained optimization. Journal of Industrial & Management Optimization, 2013, 9 (3) : 595-619. doi: 10.3934/jimo.2013.9.595 [17] Delio Mugnolo, René Pröpper. Gradient systems on networks. Conference Publications, 2011, 2011 (Special) : 1078-1090. doi: 10.3934/proc.2011.2011.1078 [18] Tatyana S. Turova. Structural phase transitions in neural networks. Mathematical Biosciences & Engineering, 2014, 11 (1) : 139-148. doi: 10.3934/mbe.2014.11.139 [19] Benedict Leimkuhler, Charles Matthews, Tiffany Vlaar. Partitioned integrators for thermodynamic parameterization of neural networks. Foundations of Data Science, 2019, 1 (4) : 457-489. doi: 10.3934/fods.2019019 [20] Ricai Luo, Honglei Xu, Wu-Sheng Wang, Jie Sun, Wei Xu. A weak condition for global stability of delayed neural networks. Journal of Industrial & Management Optimization, 2016, 12 (2) : 505-514. doi: 10.3934/jimo.2016.12.505

Impact Factor: