\`x^2+y_1+z_12^34\`
Advanced Search
Article Contents
Article Contents

Adaptive random Fourier features with Metropolis sampling

  • * Corresponding author: Aku Kammonen

    * Corresponding author: Aku Kammonen 
The research was supported by Swedish Research Council grant 2019-03725. The work of P. P. was supported in part by the ARO Grant W911NF-19-1-0243
Abstract Full Text(HTML) Figure(6) / Table(2) Related Papers Cited by
  • The supervised learning problem to determine a neural network approximation $ \mathbb{R}^d\ni x\mapsto\sum_{k = 1}^K\hat\beta_k e^{{{\mathrm{i}}}\omega_k\cdot x} $ with one hidden layer is studied as a random Fourier features algorithm. The Fourier features, i.e., the frequencies $ \omega_k\in\mathbb{R}^d $, are sampled using an adaptive Metropolis sampler. The Metropolis test accepts proposal frequencies $ \omega_k' $, having corresponding amplitudes $ \hat\beta_k' $, with the probability $ \min\big\{1, (|\hat\beta_k'|/|\hat\beta_k|)^{\gamma}\big\} $, for a certain positive parameter $ {\gamma} $, determined by minimizing the approximation error for given computational work. This adaptive, non-parametric stochastic method leads asymptotically, as $ K\to\infty $, to equidistributed amplitudes $ |\hat\beta_k| $, analogous to deterministic adaptive algorithms for differential equations. The equidistributed amplitudes are shown to asymptotically correspond to the optimal density for independent samples in random Fourier features methods. Numerical evidence is provided in order to demonstrate the approximation properties and efficiency of the proposed algorithm. The algorithm is tested both on synthetic data and a real-world high-dimensional benchmark.

    Mathematics Subject Classification: Primary: 65D15; Secondary: 65D40, 65C05.

    Citation:

    \begin{equation} \\ \end{equation}
  • 加载中
  • Figure 1.  Case 1: Graph of the target function $ f $ with sampled data set points $ (x_n,y_n) $ marked (red on-line). The inset shows $ |\hat f| $ of its Fourier transform and the detail of its behaviour at the origin

    Figure 2.  Case 1: The same data are shown in all figures with each of the six different experiments highlighted (blue on-line)

    Figure 3.  Case 2: The figure illustrates the generalization error with respect to $ K $ for a target function in dimension $ d = 5 $

    Figure 4.  Case 3: The figure illustrates the generalization error over time when the approximate problem solution is computed using Algorithm 1, Algorithm 2 and stochastic gradient descent

    Figure 5.  Case 4: Dependence on $ K $ of the misclassification percentage in the MNIST

    Figure 6.  The generalization error as a function of the standard deviation $ \sigma_\omega $ of $ p({\omega}) $

    Table 2.  Summary of numerical experiments together with their corresponding parameter choices

    Regression
    Case Case 1: a regularised step function Case 2: a high dimensional function
    Purpose Find $p$ such that the constant
    $\mathbb{E}_\omega[\frac{|\hat f(\omega)|^2}{(2\pi)^{d}p^2(\omega)}]$
    does not become large
    Study the ability of Alg. 1 to find a high dimensional dimensional function
    Target $f(x)$ $\text{Si}\left(\frac{x}{a}\right)e^{-\frac{x^2}{2}}$
    $a = 10^{-3}$
    $\text{Si}\left(\frac{x_1}{a}\right)e^{-\frac{|x|^2}{2}}$
    $a = 10^{-1}$
    $d$ $1$ $5$
    $K$ $2^i, \, i = 1, 2, ..., 11$ $2^i, \, i = 1, 2, ..., 10$
    Experiment Exp. 1 Exp. 2 Exp. 3 Exp. 4 Exp. 5 Exp. 1 Exp. 4
    Method Alg. 1 Alg. 2 RFF$^{1}$ SGM Alg. 1
    tabular
    Alg. 1 SGM
    $N$ $10^4$ $10^4$ $10^4$ $10^4$ $10^4$ $10^4$ $10^4$
    $\tilde{N}$ $10^4$ $10^4$ $10^4$ $10^4$ $10^4$ $10^4$ $10^4$
    $γ$ $3d-2$ $3d-2$ $3d-2$ $3d-2$
    $\lambda$ $0.1$ $0.1$ $0.1$ $0$ $0.1$ $0.1$ 0
    $M$ $10^3$ $5000$ N/A $10^7$ $10^4$ $2.5\times 10^3$ $10^7$
    $\bar{M}$ $10$ $10$ $10$ $10$ $10$ $10$ $10$
    $\delta$ $2.4^2/d$ $0.1$ $2.4^2/d$ $\frac{2.4^2}{10d}$
    $\Delta t$ $\mathtt{1.5e-4}$ $\mathtt{3.0e-4}$
    $t_0$ $M/10$
    $ω_{\text{max}}$ $\infty$
    $m$ $10$ $50$ $100$ $25$
    Regression Classification
    Case Case 3:
    anisotropic Gaussian
    function
    Case 4:
    The MNIST
    data set
    Purpose Computational
    complexity comparison
    Study the ability of Alg. 1 to work with non synthetic data
    Target
    $f(x)$
    $e^{-(32 x_1)^2/2}e^{-(32^{-1} x_2)^2/2}$
    $d$ $2$ $784$
    $K$ $256$ $2^i, \, i = 1, 2, ..., 13$
    Experiment Exp. 1 Exp. 2 Exp. 4 Exp. 1 Exp. 3
    Method Alg. 1 Alg. 2 SGM Alg. 1 RFF$^{2}$
    $N$ $10^4$ $10^4$ $3\times 10^7$ $6\times 10^4$ $6 \times 10^4$
    $\tilde{N}$ $10^4$ $10^4$ $10^4$ $10^4$ $10^4$
    $γ$ $3d-2$ $3d-2$ $3d-2$
    $\lambda$ $0.1$ $0.1$ $0$ $0.1$ $0.1$
    $M$ $10^4$ $10^4$ $3\times 10^7$ $10^2$
    $\bar{M}$ $1$ $1$ $1$ $1$ $1$
    $\delta$ $0.5$ $0.1$ $0.1$
    $\Delta t$ $\mathtt{1.5e-3}$
    $t_0$ $M/10$
    $ω_{\text{max}}$ $\infty$
    $m$ $100$ $100$ $M+1$
    $ ^1 $ – Random Fourier features with frequencies sampled from the fixed distribution $ \mathcal{N}(0,1) $
    $ ^2 $ – Random Fourier features with frequencies sampled from the fixed distribution $ \mathcal{N}(0,1) $, or $ \mathcal{N}(0,0.1^2) $
     | Show Table
    DownLoad: CSV

    Table 1.  Case 4: The table shows the percentage of misclassified digits in the MNIST test data set for different values of $ K $. Comparison between adaptively computed distribution of frequencies $ \omega_k $ and sampling a fixed (normal) distribution

    K Fixed Fixed Adaptive
    $ \omega\sim\mathcal{N}(0, 1) $ $ \omega\sim\mathcal{N}(0, 0.1^2) $
    2 89.97% 81.65% 80.03%
    4 89.2% 70.65% 65.25%
    8 88.98% 55.35% 53.44%
    16 88.69% 46.77% 36.42%
    32 88.9% 30.43% 23.52%
    64 88.59% 19.73% 16.98%
    128 88.7% 13.6% 11.13%
    256 88.09% 10.12% 7.99%
    512 88.01% 8.16% 5.93%
    1024 87.34% 6.29% 4.57%
    2048 86.5% 4.94% 3.5%
    4096 85.21% 3.76% 2.74%
    8192 83.98% 3.16% 1.98%
     | Show Table
    DownLoad: CSV
  • [1] H. Avron, M. Kapralov, C. Musco, C. Musco, A. Velingker and A. Zandieh, Random Fourier features for kernel ridge regression: Approximation bounds and statistical guarantees, in Proceedings of the 34th International Conference on Machine Learning (eds. D. Precup and Y. W. Teh), vol. 70 of Proceedings of Machine Learning Research, PMLR, International Convention Centre, Sydney, Australia, 2017,253–262, http://proceedings.mlr.press/v70/avron17a.html.
    [2] I. Babuška and W. C. Rheinboldt, Error estimates for adaptive finite element computations, SIAM J. Numer. Anal., 15 (1978), 736-754.  doi: 10.1137/0715049.
    [3] F. Bach, On the equivalence between kernel quadrature rules and random feature expansions, J. Mach. Learn. Res., 18 (2017), Paper No. 21, 38.
    [4] A. R. Barron, Universal approximation bounds for superpositions of a sigmoidal function, IEEE Trans. Inform. Theory, 39 (1993), 930-945.  doi: 10.1109/18.256500.
    [5] W. EC. Ma and L. Wu, A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics, Sci. China Math., 63 (2020), 1235-1258.  doi: 10.1007/s11425-019-1628-5.
    [6] L. C. Evans, Partial Differential Equations, vol. 19 of Graduate Studies in Mathematics, 2nd edition, American Mathematical Society, Providence, RI, 2010. doi: 10.1090/gsm/019.
    [7] H. Haario, E. Saksman and J. Tamminen, An adaptive metropolis algorithm, Bernoulli, 7 (2001), 223–242, https://projecteuclid.org:443/euclid.bj/1080222083. doi: 10.2307/3318737.
    [8] L. K. Jones, A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training, Ann. Statist., 20 (1992), 608-613.  doi: 10.1214/aos/1176348546.
    [9] Z. Li, J.-F. Ton, D. Oglic and D. Sejdinovic, Towards a unified analysis of random Fourier features, in ICML, 2019, arXiv: 1806.09178v4.
    [10] A. Rahimi and B. Recht, Random features for large-scale kernel machines, in Advances in Neural Information Processing Systems 20 (eds. J. C. Platt, D. Koller, Y. Singer and S. T. Roweis), Curran Associates, Inc., 2008, 1177–1184.
    [11] G. O. RobertsA. Gelman and W. R. Gilks, Weak convergence and optimal scaling of random walk Metropolis algorithms, Ann. Appl. Probab., 7 (1997), 110-120.  doi: 10.1214/aoap/1034625254.
    [12] G. O. Roberts and J. S. Rosenthal, Optimal scaling for various Metropolis-Hastings algorithms, Statist. Sci., 16 (2001), 351-367.  doi: 10.1214/ss/1015346320.
    [13] G. O. Roberts and J. S. Rosenthal, Examples of adaptive MCMC, J. Comput. Graph. Statist., 18 (2009), 349–367, http://www.jstor.org/stable/25651249. doi: 10.1198/jcgs.2009.06134.
    [14] A. Rudi and L. Rosasco, Generalization properties of learning with random features, in Advances in Neural Information Processing Systems 30 (eds. I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan and R. Garnett), Curran Associates, Inc., 2017, 3215–3225.
    [15] S. Shalev-Shwartz and  S. Ben-DavidUnderstanding Machine Learning: From Theory to Algorithms, Cambridge University Press, 2014.  doi: 10.1017/CBO9781107298019.
    [16] L. N. Trefethen and D. Bau III, Numerical Linear Algebra, Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 1997. doi: 10.1137/1.9780898719574.
    [17] A. G. Wilson and R. P. Adams, Gaussian process kernels for pattern discovery and extrapolation, in ICML, 2013, arXiv: 1302.4245v3.
  • 加载中

Figures(6)

Tables(2)

SHARE

Article Metrics

HTML views(1976) PDF downloads(321) Cited by(0)

Access History

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return