\`x^2+y_1+z_12^34\`
Advanced Search
Article Contents
Article Contents

Hierarchical regularization networks for sparsification based learning on noisy datasets

  • *Corresponding author: Prashant Shekhar

    *Corresponding author: Prashant Shekhar 

The third author is supported by NSF grant 2004302

Abstract / Introduction Full Text(HTML) Figure(3) / Table(7) Related Papers Cited by
  • We propose a hierarchical learning strategy aimed at generating sparse representations and associated models for noisy datasets. The hierarchy follows from approximation spaces identified at successively finer scales. For promoting model generalization at each scale, we also introduce a novel, projection-based penalty operator across multiple dimension, using permutation operators for incorporating proximity and ordering information. The paper presents a detailed analysis of stability and approximation properties in the reconstruction Reproducing Kernel Hilbert Spaces (RKHS) with emphasis on optimality and consistency of predictions, and behavior of error functionals associated with the produced sparse representations. Results show the data reduction and modeling capabilities of our approach on both synthetic (univariate and multivariate) and multivariate real datasets. For demonstrating the quality of our results, we compare the performance with multiple variants of Gaussian Processes with respect to performance measures of optimal smoothing and generalization. Being a sparsity driven approach, we also compare our results with widely successful sparse algorithms such as Relevance Vector Machines (RVM) and greedy approaches like Orthogonal Matching Pursuit (OMP). The superior performance of our approach on majority of experiments, as compared to the optimized implementation of these widely used algorithms, clearly justify the usefulness of the proposed approach.

    Mathematics Subject Classification: Primary: 68W25, 65D15; Secondary: 33F05.

    Citation:

    \begin{equation} \\ \end{equation}
  • 加载中
  • Figure 1.  Nature of penalty for 2-D basis functions imposed by projection on the corresponding dimensions and application of a permutation operator

    Figure 2.  (a) shows the the 1-d Schwefel function (red curve) along with the sampled noisy data (blue scattered points). For producing this dataset, firstly $ x $ and $ f $ in (55) was normalized between 0 and 1, and then a random Gaussian noise with $ \sigma = 0.05 $ was added to produce the noisy samples.(b) just shows this noisy data to give a visual intuition of the modeling complexity

    Figure 3.  Scale-wise performance and solution of the proposed approach on univariate test function with penalty $ q $ = 1. With smaller sparse representations, the approximation is oversmoothed at initial scales with noticeable improvement as the scale increases. Scale 7 here produces the best approximation. The legends are shown at the bottom of the figure

    Table 1.  Performance of the proposed approach on a univariate (1d Schwefel) test function. Here we have shown the compression ratio $ comp_s $ (56), optimal cost at each scale (18), and optimal smoothing parameter $ \hat{\lambda}_s $ for all scales (0 to 11 as shown in Figure 3). Overall the scale with the minimum fitting cost ($ Cost_s $) is highlighted ($ t $ = 7)

    $ Scale $ $ comp_s $ $ Cost_s $ $ \hat{\lambda}_{s} $
    0 0.94 4.74e-02 7.20e-06
    1 0.93 3.00e-02 1.37e-12
    2 0.92 4.76e-02 4.59e-03
    3 0.90 4.10e-02 8.78e-06
    4 0.87 6.29e-03 2.09e-16
    5 0.82 3.04e-03 3.24e-16
    6 0.77 3.08e-03 9.26e-07
    7 0.68 2.70e-03 6.42e-06
    8 0.57 2.73e-03 1.31e-05
    9 0.40 2.86e-03 2.54e-03
    10 0.18 3.12e-03 1.10e-02
    11 0.00 3.38e-03 3.23e-02
     | Show Table
    DownLoad: CSV

    Table 2.  Comparative study for smoothing performance of our proposed approach on simulated datasets with respect to standard approaches in Gaussian Processes. Here we considered samples from Schwefel function in 3 dimensions normalized between 0 to 1 as explained for the univariate case. Then for each of the 4 noise level considered ($ \sigma_d $ = 0.01, 0.05, 0.1 and 0.2 respectively), 50 training samples of noisy data each were generated and passed to the algorithms presented in the table. The values shown in the scientific notation at the top of each cell represent the average mean squared error over the 50 cases with respect to the true underlying function. For each noise level (4 rightmost columns), the best performing algorithm has been highlighted. Besides average MSE, each cell also contains the average optimized value of length scale ($ \epsilon $) and noise ($ \sigma $) hyperparameter for Gaussian process regression. Here the 'Param. Init.' column represents the initializations of the these hyperparameters passed for the Gaussian Process Kernels

    Methods Schwefel func. (d = 3) with variable noise
    Name Param. Init. $ \sigma_d = 0.01 $ $ \sigma_d = 0.05 $ $ \sigma_d = 0.1 $ $ \sigma_d = 0.2 $
    Hierarchical Algorithm q = 1 2.22e-04 1.19e-03 3.48e-03 1.01e-02
    q = 2 2.47e-04 1.51e-03 3.55e-03 1.02e-02
    Gaussian Process (RBF) $ \epsilon = 0.2 $ 1.67e-02 7.90e-03 1.87e-02 3.94e-02
    $ \sigma = 0.01 $ $ \epsilon = 14.8 $ $ \epsilon = 1.7 $ $ \epsilon = 0.01 $ $ \epsilon = 0.01 $
    $ \sigma $ = 0.1 $ \sigma $ =0.2 $ \sigma $ = 0.2 $ \sigma $ = 0.3
    $ \epsilon = 0.05 $ 6.76e-05 1.17e-03 5.15e-03 1.97e-02
    $ \sigma = 0.1 $ $ \epsilon = 0.10 $ $ \epsilon = 0.09 $ $ \epsilon = 0.12 $ $ \epsilon = 17.5 $
    $ \sigma $ = 0.009 $ \sigma $ =0.05 $ \sigma $ = 0.1 $ \sigma $ = 0.2
    $ \epsilon = 1 $ 2.07e-02 2.07e-02 2.07e-02 2.07e-02
    $ \sigma = 0.2 $ $ \epsilon = 18.93 $ $ \epsilon = 18.93 $ $ \epsilon = 18.88 $ $ \epsilon = 21.68 $
    $ \sigma $ = 0.1 $ \sigma $ = 0.2 $ \sigma $ = 0.2 $ \sigma $ = 0.2
    Gaussian Process (Matern) $ \epsilon = 0.2 $ 9.99e-05 2.46e-03 4.44e-03 1.36e-02
    $ \sigma = 0.01 $ $ \epsilon = 4.27 $ $ \epsilon = 0.25 $ $ \epsilon = 0.20 $ $ \epsilon = 5.83 $
    $ \sigma $ = 0.0002 $ \sigma $ =0.006 $ \sigma $ = 0.08 $ \sigma $ = 0.2
    $ \epsilon = 0.05 $ 9.99e-05 2.46e-03 4.44e-03 1.23e-02
    $ \sigma = 0.1 $ $ \epsilon = 4.23 $ $ \epsilon = 0.25 $ $ \epsilon = 0.20 $ $ \epsilon = 1.82 $
    $ \sigma $ = 0.00006 $ \sigma $ =0.006 $ \sigma $ = 0.08 $ \sigma $ = 0.2
    $ \epsilon = 1 $ 9.91e-05 2.47e-03 4.44e-03 1.27e-02
    $ \sigma = 0.2 $ $ \epsilon = 4.22 $ $ \epsilon = 0.27 $ $ \epsilon = 0.2 $ $ \epsilon = 4.13 $
    $ \sigma $ = 0.003 $ \sigma $ =0.02 $ \sigma $ = 0.08 $ \sigma $ = 0.2
    Gaussian Process (RQ) $ \epsilon = 0.2 $ 8.02e-05 2.17e-03 9.05e-03 3.15e-02
    $ \sigma = 0.01 $ $ \epsilon = 0.19 $ $ \epsilon = 0.32 $ $ \epsilon = 0.17 $ $ \epsilon = 0.03 $
    $ \sigma $ = 0.007 $ \sigma $ =0.02 $ \sigma $ = 0.04 $ \sigma $ = 0.1
    $ \epsilon = 0.05 $ 8.02e-05 1.63e-03 9.45e-03 4.00e-02
    $ \sigma = 0.1 $ $ \epsilon = 0.19 $ $ \epsilon = 0.32 $ $ \epsilon = 0.23 $ $ \epsilon = 0.04 $
    $ \sigma $ = 0.007 $ \sigma $ = 0.03 $ \sigma $ = 0.02 $ \sigma $ = 0.004
    $ \epsilon = 1 $ 2.07e-02 2.07e-02 2.07e-02 2.07e-02
    $ \sigma = 0.2 $ $ \epsilon = 37.52 $ $ \epsilon = 37.59 $ $ \epsilon = 37.65 $ $ \epsilon = 43.25 $
    $ \sigma $ = 0.1 $ \sigma $ = 0.2 $ \sigma $ = 0.2 $ \sigma $ = 0.2
     | Show Table
    DownLoad: CSV

    Table 3.  Sparsity based performance comparison of our proposed approach on simulated datasets with respect to Orthogonal Matching Pursuit and Relevance Vector Machines. Here we considered 200 samples from Schwefel function in 1 dimension, normalized between 0 to 1 as explained for the univariate case. Then for each of the 4 noise level considered (shown in the 4 rightmost columns), 50 training samples of noisy data each were generated and passed to the algorithms presented in the table. The values shown in the scientific notation represent the average mean squared error over the 50 cases with respect to the true underlying function. For each noise level (4 rightmost columns), the best performing algorithm has been highlighted. Here $ n^{'} $ parameter for OMP quantifies the number of functions that can be included in the model. Additionally, $ \gamma $ is the kernel coefficient parameter for RVM, and the numbers in the bracket represent the average size of sparse representations learnt

    Methods Schwefel func. (d = 1) with variable noise
    Name Param. Init. $ \sigma_d = 0.01 $ $ \sigma_d = 0.05 $ $ \sigma_d = 0.1 $ $ \sigma_d = 0.2 $
    Hierarchical Algorithm q = 1 2.93e-05 3.37e-04 1.31e-03 4.06e-03
    (90.36) (63.14) (73.94) (66.58)
    Orthogonal Matching Pursuit (OMP) $ n^{'} $ = 5 8.02e-03 1.20e-02 1.16e-02 2.03e-02
    $ n^{'} $ = 10 2.39e-03 2.39e-03 3.57e-03 8.02e-03
    $ n^{'} $ = 20 2.79e-04 4.49e-04 1.49e-03 5.07e-03
    $ n^{'} $ = 40 2.88e-05 5.02e-04 2.06e-03 7.44e-03
    $ n^{'} $ = 60 3.02e-05 6.01e-04 2.61e-03 8.72e-03
    $ n^{'} $ = 80 3.42e-05 6.13e-04 2.80e-03 9.46e-03
    $ n^{'} $ = 100 3.63e-05 6.20e-04 2.87e-03 9.79e-03
    $ n^{'} $ = 120 3.63e-05 6.20e-04 2.89e-03 9.96e-03
    $ n^{'} $ = 140 3.63e-05 6.20e-04 2.91e-03 1.01e-02
    $ n^{'} $ = 160 3.63e-05 6.20e-04 2.92e-03 1.02e-02
    $ n^{'} $ = 180 3.63e-05 6.20e-04 2.95e-03 1.02e-02
    $ n^{'} $ = 200 3.63e-05 6.20e-04 2.95e-03 1.02e-02
    Relevance Vector Machine (RVM) $ \gamma = 0.1 $ 4.58e-02 4.67e-02 4.76e-02 4.89e-02
    (39.24) (27.16) (13.78) (1.94)
    $ \gamma = 1 $ 4.44e-02 4.45e-02 4.59e-02 4.83e-02
    (8.26) (8.1) (6.86) (3.94)
    $ \gamma = 10 $ 1.18e-02 1.20e-02 1.25e-02 1.43e-02
    (26.04) (24.98) (23.22) (19.18)
    $ \gamma = 100 $ 4.50e-05 5.09e-04 1.80e-03 4.44e-03
    (21.2) (14.88) (10.96) (9.86)
    $ \gamma = 1000 $ 2.90e-05 5.14e-04 1.84e-03 5.90e-03
    (31.32) (22.44) (18.16) (14.7)
    $ \gamma = 10000 $ 7.70e-05 1.59e-03 5.80e-03 1.55e-02
    (111.66) (95.42) (88.76) (53.48)
     | Show Table
    DownLoad: CSV

    Table 4.  Analogous result to Table 3 for higher dimensions ($ d $ = 3) with a sample size of 3375 ($ 15 \times 15 \times 15 $). Here we haven't considered RVM due to its high computational complexity which makes its application infeasible for larger datasets. Similar to Table 3, for all 4 noise levels the best performing algorithm has been highlighted

    Methods Schwefel func. (d = 3) with variable noise
    Name Param. Init. $ \sigma_d = 0.01 $ $ \sigma_d = 0.05 $ $ \sigma_d = 0.1 $ $ \sigma_d = 0.2 $
    Hierarchical Algorithm q = 1 8.49e-05 1.14e-03 3.47e-03 1.01e-02
    (3374) (3374) (3375) (3375)
    Orthogonal Matching Pursuit (OMP) $ n^{'} $ = 100 1.74e-02 1.76e-02 1.62e-02 2.01e-02
    $ n^{'} $ = 200 1.38e-02 1.42e-02 1.23e-02 1.78e-02
    $ n^{'} $ = 500 5.75e-03 6.39e-03 6.25e-03 1.49e-02
    $ n^{'} $ = 1000 1.47e-03 2.38e-03 4.88e-03 1.85e-02
    $ n^{'} $ = 1500 9.27e-05 1.34e-03 6.14e-03 2.46e-02
    $ n^{'} $ = 2000 6.82e-05 1.68e-03 7.45e-03 2.97e-02
    $ n^{'} $ = 2500 8.01e-05 2.00e-03 8.58e-03 3.41e-02
    $ n^{'} $ = 3000 9.23e-05 2.31e-03 9.55e-03 3.81e-02
    $ n^{'} $ = 3375 9.18e-04 4.96e-03 1.00e-02 4.01e-02
     | Show Table
    DownLoad: CSV

    Table 5.  Generalization/Testing error analysis: Generalization study for our proposed approach on real datasets, compared with multiple variants of GPs identified by different hyperparameter initialization. Here we consider the Diabetes dataset which is 10 dimensional and Boston Housing dataset which is 13 dimensional. For each of the dataset, 3 train/test splittings were considered with 100 repetitions each for higher confidence in the predicted results. The values shown in each of the last 6 columns represent the average mean squared error over the 100 cases with respect to the true underlying function. For each dataset and splitting criterion, the best performing algorithm has been highlighted

    Methods Diabetes Dataset (d = 10) Boston Housing Dataset (d = 13)
    Name Parameter Initialization 40/60 60/40 80/20 40/60 60/40 80/20
    Hierarchical Algorithm q = 1 6.44e+03 6.37e+03 6.29e+03 2.72e+01 2.61e+01 2.50e+01
    q = 2 6.40e+03 6.37e+03 6.29e+03 2.96e+01 2.83e+01 2.71e+01
    Gaussian Process (RBF) $ \epsilon = 0.2 $ & $ \sigma^2 = 0.01 $ 2.39e+04 2.25e+04 2.06e+04 1.33e+01 1.17e+01 9.46e+00
    $ \epsilon = 0.05 $ & $ \sigma^2 = 0.2 $ 1.49e+04 1.22e+04 1.69e+04 1.33e+01 1.17e+01 9.46e+00
    $ \epsilon = 1 $ & $ \sigma^2 = 1 $ (default) 2.68e+04 2.85e+04 2.81e+04 9.02e+01 1.58e+02 4.56e+02
    Gaussian Process (Matern) $ \epsilon = 0.2 $ & $ \sigma^2 = 0.01 $ 2.64e+04 2.57e+04 2.83e+04 1.87e+01 1.74e+01 1.67e+02
    $ \epsilon = 0.05 $ & $ \sigma^2 = 0.2 $ 1.97e+04 2.90e+04 2.83e+04 1.24e+01 1.10e+01 2.57e+02
    $ \epsilon = 1 $ & $ \sigma^2 = 1 $ (default) 2.85e+04 2.90e+04 2.83e+04 8.53e+01 7.99e+01 5.68e+01
    Gaussian Process (RQ) $ \epsilon = 0.2 $ & $ \sigma^2 = 0.01 $ 5.43e+03 4.04e+03 3.73e+03 8.64e+01 7.56e+01 4.93e+01
    $ \epsilon = 0.05 $ & $ \sigma^2 = 0.2 $ 6.05e+03 5.78e+03 5.55e+03 8.64e+01 8.47e+01 8.61e+01
    $ \epsilon = 1 $ & $ \sigma^2 = 1 $ (default) 2.91e+04 2.89e+04 2.80e+04 3.76e+02 4.24e+02 4.52e+02
     | Show Table
    DownLoad: CSV

    Table 6.  Training error analysis: Average training error (over 100 train sets) for all the experiments presented in Table 5. For each dataset and splitting criterion, again the best performing algorithm with minimum average training MSE has been highlighted

    Methods Diabetes Dataset (d = 10) Boston Housing Dataset (d = 13)
    Name Parameter Initialization 40/60 60/40 80/20 40/60 60/40 80/20
    Hierarchical Algorithm q = 1 6.36e+03 6.40e+03 6.39e+03 2.17e+01 2.15e+01 2.19e+01
    q = 2 6.32e+03 6.40e+03 6.39e+03 2.49e+01 2.42e+01 2.43e+01
    Gaussian Process (RBF) $ \epsilon = 0.2 $ & $ \sigma^2 = 0.01 $ 1.52e-02 2.26e-02 2.69e-02 5.25e+00 4.63e+00 4.52e+00
    $ \epsilon = 0.05 $ & $ \sigma^2 = 0.2 $ 4.26e-02 5.31e-02 3.97e-02 5.25e+00 4.63e+00 4.52e+00
    $ \epsilon = 1 $ & $ \sigma^2 = 1 $ (default) 1.07e+04 1.20e+04 1.25e+04 8.51e+01 1.16e+02 3.44e+02
    Gaussian Process (Matern) $ \epsilon = 0.2 $ & $ \sigma^2 = 0.01 $ 6.98e-03 9.45e-03 1.78e-03 3.43e+00 2.31e+00 1.87e+00
    $ \epsilon = 0.05 $ & $ \sigma^2 = 0.2 $ 1.94e-02 2.32e-03 2.27e-03 2.79e+00 2.69e+00 1.79e+00
    $ \epsilon = 1 $ & $ \sigma^2 = 1 $ (default) 9.61e+03 1.12e+04 1.20e+04 8.37e+01 8.04e+01 4.22e+01
    Gaussian Process (RQ) $ \epsilon = 0.2 $ & $ \sigma^2 = 0.01 $ 5.29e+03 3.95e+03 3.70e+03 8.48e+01 7.57e+01 4.65e+01
    $ \epsilon = 0.05 $ & $ \sigma^2 = 0.2 $ 5.97e+03 5.83e+03 5.59e+03 8.48e+01 8.53e+01 8.44e+01
    $ \epsilon = 1 $ & $ \sigma^2 = 1 $ (default) 2.90e+04 2.91e+04 2.89e+04 3.75e+02 4.25e+02 4.50e+02
     | Show Table
    DownLoad: CSV

    Table 7.  Training time Analysis (in seconds): Average training time (over 100 train sets) for all the experiments presented in Table 5. For each experiment, again the best performing algorithm with minimum average training time has been highlighted

    Methods Diabetes Dataset (d = 10) Boston Housing Dataset (d = 13)
    Name Parameter Initialization 40/60 60/40 80/20 40/60 60/40 80/20
    Hierarchical Algorithm q = 1 2.39e+00 5.44e+00 1.09e+01 2.20e+00 4.90e+00 9.04e+00
    q = 2 2.55e+00 6.12e+00 1.19e+01 2.23e+00 4.88e+00 9.36e+00
    Gaussian Process (RBF) $ \epsilon = 0.2 $ & $ \sigma^2 = 0.01 $ 5.08e+00 9.76e+00 1.93e+01 7.04e+00 1.64e+01 3.36e+01
    $ \epsilon = 0.05 $ & $ \sigma^2 = 0.2 $ 6.07e+00 1.32e+01 2.22e+01 8.79e+00 1.93e+01 3.62e+01
    $ \epsilon = 1 $ & $ \sigma^2 = 1 $ (default) 3.73e+00 6.30e+00 1.15e+01 6.90e+00 1.38e+01 1.71e+01
    Gaussian Process (Matern) $ \epsilon = 0.2 $ & $ \sigma^2 = 0.01 $ 7.15e+00 1.21e+01 2.40e+01 1.17e+01 2.42e+01 3.45e+01
    $ \epsilon = 0.05 $ & $ \sigma^2 = 0.2 $ 7.15e+00 1.12e+01 2.26e+01 1.04e+01 2.02e+01 3.40e+01
    $ \epsilon = 1 $ & $ \sigma^2 = 1 $ (default) 3.72e+00 6.86e+00 1.28e+01 8.10e+00 1.87e+01 4.88e+01
    Gaussian Process (RQ) $ \epsilon = 0.2 $ & $ \sigma^2 = 0.01 $ 7.62e+00 3.54e+01 6.77e+01 1.36e+00 1.13e+01 7.38e+01
    $ \epsilon = 0.05 $ & $ \sigma^2 = 0.2 $ 1.28e+00 1.43e+01 2.82e+01 1.35e+00 2.52e+00 3.93e+00
    $ \epsilon = 1 $ & $ \sigma^2 = 1 $ (default) 3.18e+00 7.17e+00 1.37e+01 6.57e+00 1.48e+01 2.50e+01
     | Show Table
    DownLoad: CSV
  • [1] M. Ahmed, Data summarization: A survey, Knowledge and Information Systems, 58 (2019), 249-273.  doi: 10.1007/s10115-018-1183-0.
    [2] W. K. AllardG. Chen and M. Maggioni, Multi-scale geometric methods for data sets ⅱ: Geometric multi-resolution analysis, Applied and Computational Harmonic Analysis, 32 (2012), 435-462.  doi: 10.1016/j.acha.2011.08.001.
    [3] N. Aronszajn, Theory of reproducing kernels, Transactions of the American Mathematical Society, 68 (1950), 337-404.  doi: 10.1090/S0002-9947-1950-0051437-7.
    [4] L. M. Berliner, Hierarchical bayesian time series models, in Maximum Entropy and Bayesian Methods, Springer, 1996, 15-22.
    [5] A. BermanisA. Averbuch and R. R. Coifman, Multiscale data sampling and function extension, Applied and Computational Harmonic Analysis, 34 (2013), 15-29.  doi: 10.1016/j.acha.2012.03.002.
    [6] B. BohnJ. Garcke and M. Griebel, A sparse grid based method for generative dimensionality reduction of high-dimensional data, Journal of Computational Physics, 309 (2016), 1-17.  doi: 10.1016/j.jcp.2015.12.033.
    [7] N. A. Borghese and S. Ferrari, Hierarchical rbf networks and local parameters estimate, Neurocomputing, 19 (1998), 259-283.  doi: 10.1016/S0925-2312(97)00094-5.
    [8] Z. Borsos, M. Mutnỳ, M. Tagliasacchi and A. Krause, Data summarization via bilevel optimization, arXiv preprint, arXiv: 2109.12534.
    [9] W. L. Briggs, V. E. Henson and S. F. McCormick, A Multigrid Tutorial, vol. 72, SIAM, 2000. doi: 10.1137/1.9780898719505.
    [10] M. D. Buhmann, Radial Basis Functions: Theory and Implementations, vol. 12, Cambridge University Press, 2003. doi: 10.1017/CBO9780511543241.
    [11] D. ChaudhuriC. A. Murthy and B. B. Chaudhuri, Finding a subset of representative points in a data set, IEEE Transactions on Systems, Man, and Cybernetics, 24 (1994), 1416-1424.  doi: 10.1109/21.310520.
    [12] G. Chen, A. V. Little and M. Maggioni, Multi-resolution geometric analysis for data in high dimensions, in Excursions in Harmonic Analysis, Volume 1, Springer, 2013,259-285. doi: 10.1007/978-0-8176-8376-4_13.
    [13] R. R. CoifmanS. LafonA. B. LeeM. MaggioniB. NadlerF. Warner and S. W. Zucker, Geometric diffusions as a tool for harmonic analysis and structure definition of data: Multiscale methods, Proceedings of the National Academy of Sciences, 102 (2005), 7432-7437.  doi: 10.1073/pnas.0500896102.
    [14] R. R. Coifman and M. Maggioni, Diffusion Wavelets, Applied and Computational Harmonic Analysis, 21 (2006), 53-94.  doi: 10.1016/j.acha.2006.04.004.
    [15] N. Cressie and C. K. Wikle, Statistics for Spatio-Temporal Data, John Wiley & Sons, 2015.
    [16] I. Czarnowski and P. Jedrzejowicz, An approach to data reduction for learning from big datasets: Integrating stacking, rotation, and agent population learning techniques, Complexity, 2018. doi: 10.1155/2018/7404627.
    [17] I. Daubechies, Ten Lectures on Wavelets, vol. 61, SIAM, 1992. doi: 10.1137/1.9781611970104.
    [18] C. De Boor, A Practical Guide to Splines, vol. 27, springer-verlag New York, 1978.
    [19] S. De Marchi and R. Schaback, Stability of kernel-based interpolation, Advances in Computational Mathematics, 32 (2010), 155-161.  doi: 10.1007/s10444-008-9093-4.
    [20] P. H. C. Eilers and B. D. Marx, Flexible smoothing with b-splines and penalties, Statistical Science, 11 (1996), 89-121.  doi: 10.1214/ss/1038425655.
    [21] M. Elad, Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing, Springer Science & Business Media, 2010. doi: 10.1007/978-1-4419-7011-4.
    [22] D. ElbrächterD. PerekrestenkoP. Grohs and H. Bölcskei, Deep neural network approximation theory, IEEE Transactions on Information Theory, 67 (2021), 2581-2623.  doi: 10.1109/TIT.2021.3062161.
    [23] T. EvgeniouM. Pontil and T. Poggio, Regularization networks and support vector machines, Advances in Computational Mathematics, 13 (2000), 1-50.  doi: 10.1023/A:1018946025316.
    [24] G. E. Fasshauer and J. G. Zhang, Preconditioning of radial basis function interpolation systems via accelerated iterated approximate moving least squares approximation, in Progress on Meshless Methods, Springer, 2009, 57-75. doi: 10.1007/978-1-4020-8821-6_4.
    [25] S. FerrariM. Maggioni and N. A. Borghese, Multiscale approximation with hierarchical radial basis functions networks, IEEE Transactions on Neural Networks, 15 (2004), 178-188.  doi: 10.1109/TNN.2003.811355.
    [26] F. Ferraty and P. Vieu, Nonparametric Functional Data Analysis: Theory and Practice, Springer Science & Business Media, 2006.
    [27] M. S. Floater and A. Iske, Multistep scattered data interpolation using compactly supported radial basis functions, Journal of Computational and Applied Mathematics, 73 (1996), 65-78.  doi: 10.1016/0377-0427(96)00035-0.
    [28] M. R. Forster, Key concepts in model selection: Performance and generalizability, Journal of Mathematical Psychology, 44 (2000), 205-231.  doi: 10.1006/jmps.1999.1284.
    [29] M. GalunR. Basri and I. Yavneh, Review of methods inspired by algebraic-multigrid for data and image analysis applications, Numerical Mathematics: Theory, Methods and Applications, 8 (2015), 283-312.  doi: 10.4208/nmtma.2015.w14si.
    [30] G. H. GolubM. Heath and G. Wahba, Generalized cross-validation as a method for choosing a good ridge parameter, Technometrics, 21 (1979), 215-223.  doi: 10.1080/00401706.1979.10489751.
    [31] I. GoodfellowY. Bengio and  A. CourvilleDeep Learning, MIT press, 2016. 
    [32] P. J. Green and  B. W. SilvermanNonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach, CRC Press, 1993.  doi: 10.1007/978-1-4899-4473-3.
    [33] M. Griebel and A. Hullmann, A sparse grid based generative topographic mapping for the dimensionality reduction of high-dimensional data, in Modeling, Simulation and Optimization of Complex Processes-HPSC 2012, Springer, 2014, 51-62. doi: 10.1007/978-3-319-09063-4_5.
    [34] T. Hastie, R. Tibshirani and M. Wainwright, Statistical Learning with Sparsity: The Lasso and Generalizations, Chapman and Hall/CRC, 2015.
    [35] T. Hsing and R. Eubank, Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators, John Wiley & Sons, 2015. doi: 10.1002/9781118762547.
    [36] A. Iske, Scattered data approximation by positive definite kernel functions, Rend. Sem. Mat. Univ. Pol. Torino, 69 (2011), 217-246. 
    [37] D. KushnirM. Galun and A. Brandt, Efficient multilevel eigensolvers with applications to data analysis tasks, IEEE Transactions on Pattern Analysis and Machine Intelligence, 32 (2009), 1377-1391.  doi: 10.1109/TPAMI.2009.147.
    [38] M. Maggioni, J. C. Bremer Jr, R. R. Coifman and A. D. Szlam, Biorthogonal diffusion wavelets for multiscale representation on manifolds and graphs, in Wavelets XI, vol. 5914, International Society for Optics and Photonics, 2005, 59141M. doi: 10.1117/12.616909.
    [39] S. G. Mallat, A theory for multiresolution signal decomposition: The wavelet representation, IEEE Transactions on Pattern Analysis and Machine Intelligence, 11 (1989), 674-693.  doi: 10.1515/9781400827268.494.
    [40] J. T. Oden and L. F. Demkowicz, Applied Functional Analysis, Chapman and Hall/CRC, 2017.
    [41] N. D. Pearce and M. P. Wand, Penalized splines and reproducing kernel methods, The American Statistician, 60 (2006), 233-240.  doi: 10.1198/000313006X124541.
    [42] F. PedregosaG. VaroquauxA. GramfortV. MichelB. ThirionO. GriselM. BlondelP. PrettenhoferR. WeissV. DubourgJ. VanderplasA. PassosD. CournapeauM. BrucherM. Perrot and E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research, 12 (2011), 2825-2830. 
    [43] T. Poggio and F. Girosi, Networks for approximation and learning, Proceedings of the IEEE, 78 (1990), 1481-1497.  doi: 10.1109/5.58326.
    [44] T. Poggio and F. Girosi, Regularization algorithms for learning that are equivalent to multilayer networks, Science, 247 (1990), 978-982.  doi: 10.1126/science.247.4945.978.
    [45] C. E. Rasmussen, Gaussian processes in machine learning, in Summer School on Machine Learning, Springer, 2003, 63-71.
    [46] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning, vol. 2, MIT press Cambridge, MA, 2006.
    [47] M. H. ur RehmanC. S. LiewA. AbbasP. P. JayaramanT. Y. Wah and S. U. Khan, Big data reduction methods: A survey, Data Science and Engineering, 1 (2016), 265-284.  doi: 10.1007/s41019-016-0022-0.
    [48] D. Ruppert, M. P. Wand and R. J. Carroll, Semiparametric Regression, vol. 12, Cambridge University Press, 2003. doi: 10.1017/CBO9780511755453.
    [49] Y. Saad, Iterative Methods for Sparse Linear Systems, vol. 82, SIAM, 2003. doi: 10.1137/1.9780898718003.
    [50] B. Schölkopf, A. J. Smola, F. Bach, et al., Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, MIT press, 2002.
    [51] J. SchreiberJ. Bilmes and W. S. Noble, apricot: Submodular selection for data summarization in python, J. Mach. Learn. Res., 21 (2020), 6474-6479. 
    [52] P. Shekhar and A. Patra, Hierarchical approximations for data reduction and learning at multiple scales, Foundations of Data Science, 2 (2020), 123-154.  doi: 10.3934/fods.2020008.
    [53] K. Stüben, A review of algebraic multigrid, in Numerical Analysis: Historical Developments in the 20th Century, Elsevier, 2001,331-359. doi: 10.1016/B978-0-444-50617-7.50015-X.
    [54] S. Surjanovic and D. Bingham, Virtual library of simulation experiments: Test functions and datasets, Retrieved November 14, 2019, from http://www.sfu.ca/ ssurjano.
    [55] J. Tejada, M. Alexandrov, G. Skitalinskaya and D. Stefanovskiy, Selection of statistically representative subset from a large data set, in Iberoamerican Congress on Pattern Recognition, Springer, 2016,476-483. doi: 10.1007/978-3-319-52277-7_58.
    [56] M. E. Tipping, The relevance vector machine, in Advances in Neural Information Processing Systems, 2000,652-658.
    [57] V. N. Vapnik, The Nature of Statistical Learning Theory, Springer Science & Business Media, 2013. doi: 10.1007/978-1-4757-2440-0.
    [58] G. Wahba, Spline Models for Observational Data, vol. 59, SIAM, 1990. doi: 10.1137/1.9781611970128.
    [59] H. Wendland, Scattered Data Approximation, vol. 17, Cambridge University Press, 2005.
    [60] A. A. Yıldırım, C. Özdoğan and D. Watson, Parallel data reduction techniques for big datasets, in Big Data: Concepts, Methodologies, Tools, and Applications, IGI Global, 2016,734-756. doi: 10.4018/978-1-4666-9840-6.ch034.
    [61] S. Zhou, Sparse svm for sufficient data reduction, IEEE Transactions on Pattern Analysis and Machine Intelligence, 44 (2022), 5560-5571.  doi: 10.1109/TPAMI.2021.3075339.
  • 加载中

Figures(3)

Tables(7)

SHARE

Article Metrics

HTML views(1743) PDF downloads(184) Cited by(0)

Access History

Other Articles By Authors

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return