We consider learning rates of kernel regularized regression (KRR) based on reproducing kernel Hilbert spaces (RKHSs) and differentiable strongly convex losses and provide some new strongly convex losses. We first show the robustness with the maximum mean discrepancy (MMD) and the Hutchinson metric respectively, and, along this line, bound the learning rate of the KRR. We first provide a capacity dependent learning rate and then give the learning rates for four concrete strongly convex losses respectively. In particular, we provide the learning rates when the hypothesis RKHS's logarithmic complexity exponent is arbitrarily small as well as sufficiently large.
Citation: |
[1] | R. A. Adams and J. J. F. Fournier, Sobolev Space, 2$^nd$ edition, Academic Press, 2009. |
[2] | A. Argyriou, C. A. Micchelli and M. Pontil, On spectral learning, J. Mach. Learn. Res., 11 (2010), 935-953. doi: 10.1093/protein/6.4.383. |
[3] | N. Aronszajn, Theory of reproducing kernels, Trans. Amer. Math. Soc., 68 (1950), 337-404. doi: 10.2307/1990404. |
[4] | A. Banerjee, S. Merugu, I. S. Dhillon and J. Ghosh, Clustering with Bregman divergences, J. Mach. Learn. Res., 6 (2005), 1705-1749. |
[5] | P. L. Bartlett, M. I. Jordan and J. D. McAuliffe, Convex, classification, and risk bounds, J. Amer. Statist. Assoc., 101 (2006), 138-156. doi: 10.1198/016214505000000907. |
[6] | F. Bauer, S. Pereverzey and L. Rosasco, On regularization algorithms in learning theory, J. Complexity, 23 (2007), 52-72. doi: 10.1016/j.jco.2006.07.001. |
[7] | H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces, Springer-Verlag, New York, 2010. doi: 10.1007/978-1-4419-9467-7. |
[8] | A. Bietti and J. Mairal, Group invariance, stability to deformations, and complexity of deep convolutional representations, J. Mach. Learn. Res., 20 (2019), 1-49. |
[9] | G. Blanchard and N. Krämer, Optimal learning rates for kernel conjugate gradient regression, preprint, arXiv: 1009.5839. |
[10] | G. Blanchard and N. Mücke, Optimal rates for regularized of statistical inverse learning problems, Found. Math. Comput., 18 (2018), 971-1013. doi: 10.1007/s10208-017-9359-7. |
[11] | B. Bohn and C. Rieger, A representer theorem for deep kernel learning, J. Mach. Learn. Res., 20 (2019), 1-32. |
[12] | J. F. Bonnans and A. Shapiro, Perturbation Analysis of Optimization Problem, Springer-Verlag, New York, 2000. doi: 10.1007/978-1-4612-1394-9. |
[13] | E. M. Bronshtein, $\varepsilon$-entropy of convex stes and functions, Siberian Math. J., 17 (1976), 393-398. |
[14] | A. Caponnetto and E. De Vito, Optimal rates for the regularized least-squares algorithm, Found. Comput. Math., 7 (2007), 331-368. doi: 10.1007/s10208-006-0196-8. |
[15] | X. Y. Chang, Z. B. Xu, B. Zou and H. Zhang, Generalization bounds of regularization algorithms derived simultaneously through hypothesis space complexity, algorithmic stability and data quality, Int. J. Wavelets Multiresolut. Inform. Process., 9 (2011), 549-570. doi: 10.1142/S0219691311004213. |
[16] | D. R. Chen, Q. Wu, Y. M. Ying and D. X. Zhou, Support vector machine soft margin classifiers: error analysis, J. Mach. Learn. Res., 5 (2004), 1143-1175. |
[17] | A. Christmann and A. V. Messem, Bouligand derivatives and robustness of support vector machines for regression, J. Mach. Learn. Res., 9 (2008), 915-936. |
[18] | A. Christmann, A. V. Messem and I. Steinwart, On consistency and robustness properties of support vector machines for heavy-tailed distributions, Statist. Interface, 2 (2009), 311-327. doi: 10.4310/SII.2009.v2.n3.a5. |
[19] | A. Christmann and I. Steinwart, On robustness properties of convex risk minimization methods for pattern recognition, J. Mach. Learn. Res., 5 (2004), 1007-1034. |
[20] | A. Christmann and I. Steinwart, Consistency and robustness of kernel based regression, Bernoulli, 13 (2007), 799-819. doi: 10.3150/07-BEJ5102. |
[21] | A. Christmann and I. Steinwart, Consistency of kernel based quantile regression, Appl. Stoch. Model. Bus. and Industr., 24 (2008), 171-183. doi: 10.1002/asmb.700. |
[22] | A. Christmann, D. H. Xiang and D. X. Zhou, Total stability of kernel methods, Neurocomputing, 289 (2018), 101-118. |
[23] | A. Christmann and D. X. Zhou, Learning rates for the risk of kernel-based quantile regression estimators in additive models, Anal. Appl., 14 (2016), 449-477. doi: 10.1142/S0219530515500050. |
[24] | A. Christmann and D. X. Zhou, On the robustness of regularized pairwise learning methods based on kernels, J. Complexity, 37 (2016), 1-33. doi: 10.1016/j.jco.2016.07.001. |
[25] | F. Cucker and S. Smale, On the mathematical foundations of learning theory, Bull. Amer. Math. Soc., 39 (2001), 1-49. doi: 10.1090/S0273-0979-01-00923-5. |
[26] | F. Cucker and S. Smale, Best choices for regularized parameters in learning theory: on the bias-variance problem, Found. Comput. Math., 2 (2002), 413-428. doi: 10.1007/s102080010030. |
[27] | F. Cucker and D. X. Zhou, Learning Theory: An Approximation Theory Viewpoint, Cambridge University Press, New York, 2007. doi: 10.1017/CBO9780511618796. |
[28] | M. Debruyne, A. Christmann, M. Huber and J. A. K. Suykens, Robustness of reweighted least squares kernel based regression, J. Multi. Anal., 101 (2010), 447-463. doi: 10.1016/j.jmva.2009.09.007. |
[29] | E. De Vito, A. Caponnetto and L. Rosasco, Model selection for regularized least-squares algorithm in learning theory, Found. Comput. Math., 5 (2005), 59-85. doi: 10.1007/s10208-004-0134-1. |
[30] | E. De Vito, L. Rosasco, A. Caponnetto, M. Piana and A. Verri, Some properties of regularized kernel methods, J. Mach. Learn. Res., 5 (2004), 1363-1390. |
[31] | E. De Vito, L. Rosasco and A. Toigo, Learning sets with separating kernels, Appl. Comput. Harmon. Anal., 37 (2014), 185-217. doi: 10.1016/j.acha.2013.11.003. |
[32] | F. Dumpert, Universal consistency and robustness of localized support vector machines, Neurocomputting, 315 (2018), 96-106. |
[33] | M. Ebets and I. Steinwart, Optimal regression rates for SVMs using Gaussian kernels, Electron. J. Statist., 6 (2012), 2627-2668. doi: 10.1214/12-EJS760. |
[34] | T. Evgeniou, M. Pontil and T. Poggio, Regularization networks and support vector machines, Adv. Comput. Math., 13 (2000), 1-50. doi: 10.1023/A:1018946025316. |
[35] | M. Farooq and I. Steinwart, Learning rates for kernel-based expectile regression, Mach. Learn., 108 (2019), 203-227. doi: 10.1007/s10994-018-5762-9. |
[36] | S. Fischer and I. Steinwart, Sobolev norm learning rates for regularized least-squares algorithm, preprint, arXiv: 1702.07254, Fachbereich Mathematik Fakultat Mathematik und Physik, Universitat Stuttgart, Pfaffenwaldring, 57, D-70,569, Stuttgart. |
[37] | A. Guntuboyina and B. Sen, Covering numbers for convex functions, IEEE Trans. Inform. Theory, 59 (2013), 1957-1965. doi: 10.1109/TIT.2012.2235172. |
[38] | Z. C. Guo, T. Hu and L. Shi, Gradient descent for robust kernel-based regression, Inverse Problem, 34 (2018), Art. 065009, 29 pp. doi: 10.1088/1361-6420/aabe55. |
[39] | R. Hable and A. Christmann, Qualitative robustness of support vector machines, J. Multi. Anal., 102 (2011), 993-1007. doi: 10.1016/j.jmva.2011.01.009. |
[40] | T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer-Veralg, Berlin, 2001. doi: 10.1007/978-0-387-21606-5. |
[41] | J. B. Hiriart-urruty and C.Lemaréchal, Fundamentals of Convex Analysis, Springer-Verlag, 2004. |
[42] | J. H. Lin, A. Rudi, L. Rosasco and V. Cevher, Optimal rates for spectral algorithms with least-squares regression over Hilbert spaces, Appl. Comput. Harmon. Anal., 48 (2020), 868-890. doi: 10.1016/j.acha.2018.09.009. |
[43] | H. X. Liu, B. H. Sheng and P. X. Ye, The improved learning rate for regularized regression with RKBSs, Int. J. Math. Learn. Cyber., 8 (2017), 1235-1245. |
[44] | Y. X. Ma and H. W. Sun, Application of integral operator for vector-valued regression learning, Int. J. Wavelets Multiresolut. Inform. Process., 13 (2015), Art. 1550047, 16 pp. doi: 10.1142/S0219691315500472. |
[45] | A. V. Messem and A. Christmann, A review on consistency and robustness properties of support vector machines for heavy-tailed distributions, Adv. Data Anal. Classif., 4 (2010), 199-220. doi: 10.1007/s11634-010-0067-2. |
[46] | C. A. Micchelli and M. Pontil, Learning the kernel function via regularization, J. Mach. Learn. Res., 6 (2005), 1099-1125. |
[47] | C. A. Micchelli, Y. S. Xu and H. Z. Zhang, Universal kernels, J. Mach. Learn. Res., 7 (2006), 2651-2667. |
[48] | K. Muandet, K. Fukumizu, B. K. Sriperumbudur and B. Schölkopf, Kernel mean embedding of distributions: a review and beyonds, Found. Trends Mach. Learn., 10 (2017), 1-14. doi: 10.1561/2200000060. |
[49] | Y. Nishiyama, Characteristic kernels and infinitely divisible distributions, J. Mach. Learn. Res., 17 (2016), 1-28. |
[50] | J. Peypouquet, Convex Optimization in Normed Spaces: Theory, Methods and Examples, Springer-Verlag, 2015. doi: 10.1007/978-3-319-13710-0. |
[51] | Z. W. Qin, F. X. Yu, C. C. Liu and X. Chen, How convolutional neural networks see the world: a survey of convolution neural network visualization methods, Math. Found. Comput., 1 (2018), 149-180. |
[52] | M. M. Rao and Z. D. Ren, Applications of Orlicz Spaces, Marcel Dekker. Inc., New York, 2002. doi: 10.1201/9780203910863. |
[53] | L. Rosasco and M. Belkin, On learning with integral operators, J. Mach. Learn. Res., 11 (2010), 905-934. |
[54] | Z. Sha and H. J. Ruan, Fractal and Fitting (in Chinese), Zhejiang University Press, Hangzhou, 2005. |
[55] | B. H. Sheng, On approximation by reproducing kernel spaces in weighted $L^p-$spaces, J. Syst. Sci. Complex., 20 (2007), 623-638. doi: 10.1007/s11424-007-9061-y. |
[56] | B. H. Sheng, L. Q. Duan and P. X. Ye, Strong convex loss can increase the learning rates of online learning, J. Computer, 9 (2014), 1606-1611. |
[57] | B. H. Sheng and H. T. Li, On approximation by spherical zonal translation networks based on Bochner Riesz means, J. Syst. Sci. Complex., 18 (2005), 361-374. |
[58] | B. H. Sheng and J. L. Wang, On the $K$-functional in learning theory, Anal. Appl., 18 (2020), 423-446. doi: 10.1142/S0219530519500192. |
[59] | B. H. Sheng, J. L. Wang and D. H. Xiang, Error analysis on Hérmite learning with gradient data, Chin. Ann. Math. Ser. B, 39 (2018), 705-720. doi: 10.1007/s11401-018-0091-7. |
[60] | B. H. Sheng, J. L. Wang and S. P. Zhou, A way of constructing spherical zonal translation network operators with linear bounded operators, Taiwan. J. Math., 12 (2008), 77-92. doi: 10.11650/twjm/1500602490. |
[61] | B. H. Sheng and D. H. Xiang, The convergence rate for a K-functional in learning theory, J. Inequal. Appl., (2010), Art. 249507, 18 pp. doi: 10.1155/2010/249507.. |
[62] | B. H. Sheng and D. H. Xiang, The learning rate of $l_2$-coefficient regularized classification with strong loss, Acta Math. Sin. English Ser., 29 (2013), 2397-2408. doi: 10.1007/s10114-013-0175-y. |
[63] | B. H. Sheng, P. X. Ye and J. L. Wang, Learning rates for least square regressions with coefficient regularization, Acta Math. Sin., 28 (2012), 2205-2212. doi: 10.1007/s10114-012-0607-0. |
[64] | S. Smale and D. X. Zhou, Estimating the approximation error in learning theory, Anal. Appl., 1 (2003), 17-41. doi: 10.1142/S0219530503000089. |
[65] | S. Smale and D. X. Zhou, Learning theory estimates via integral operators and their applications, Constr. Approx., 26 (2007), 153-172. doi: 10.1007/s00365-006-0659-y. |
[66] | B. H. Sheng and H. C. Zhu, The convergence rate of semi-supervised regression with quadratic loss, Appl. Math. Comput., 321 (2018), 11-24. doi: 10.1016/j.amc.2017.10.033. |
[67] | B. K. Sriperumbudur, K. Fukumizu and G. R. G. Lanckriet, Universality, characteristic kernels and RKHS embedding of measures, J. Mach. Learn. Res., 9 (2010), 773-780. |
[68] | B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf and G. R. G. Lanckriet, Hilbert space embeddings and metrics on probability measures, J. Mach. Learn. Res., 11 (2010), 1517-1561. |
[69] | I. Steinwart, Sparseness of support vector machines, J. Mach. Learn. Res., 4 (2003), 1071-1105. doi: 10.1162/1532443041827925. |
[70] | I. Steinwart and A. Christmann, How SVMs can estimate quantiles and the median, Adv. Neural Inform. Process. Syst., 20 (2008), 305-312. |
[71] | I. Steinwart and A. Christmann, Estimating conditional quantiles with the help of the pinball loss, Bernoulli, 17 (2011), 211-225. doi: 10.3150/10-BEJ267. |
[72] | I. Steinwart and A. Christmann, Support Vector Machines, Springer-Verlag, New York, 2008. |
[73] | I. Steinwart, D. Hush and C. Scovel, Optimal rates for regularized least square regression, in Proceedings of the 22nd Annual Conference on Learning Theory, (2009), 79–93. |
[74] | H. W. Sun and Q. Wu, Application of integral operator for regularized least-square regression, Math. Comput. Model., 49 (2009), 276-285. doi: 10.1016/j.mcm.2008.08.005. |
[75] | H. Z. Tong, D. R. Chen and L. Z. Peng, Analysis of support vector machines regression, Found. Comput. Math., 9 (2009), 243-257. doi: 10.1007/s10208-008-9026-0. |
[76] | H. Z. Tong, D. R. Chen and F. H. Yang, Fast learning rates for regularized regression algorithms (in Chinese), Sci. Sin. Math., 42 (2012), 1251-1262. |
[77] | V. Vapnik, Statistical Learning Theory, Wiley, New York, 1998. |
[78] | C. Wang and J. Cai, Convergence analysis of coefficient-based regularization under moment incremental condition, Int. J. Wavelets Multiresolut. Inform. Process., 12 (2014), Art. 1450008, 19 pp. doi: 10.1142/S0219691314500088. |
[79] | S. H. Wang, Z. L. Chen and B. H. Sheng, Convergence rate of SVM for kernel-based robust regression, Int. J. Wavelets Multiresolut. Inform. Process., 17 (2019), Art. 1950004, 21 pp. doi: 10.1142/S0219691319500048. |
[80] | C. Wang and Z. C. Guo, ERM learning with unbounded sampling, Acta Math. Sin. English Ser., 28 (2012), 97-104. doi: 10.1007/s10114-012-9739-5. |
[81] | X. Z. Wang, Y. X. Zhao and F. Pourpanah, Recent advances in deep learning, Int. J. Math. Learn. Cyber., 11 (2020), 747-750. |
[82] | C. Wang and D. X. Zhou, Optimal learning rates for least squares regularized regression with unbounded sampling, J. Complexity, 27 (2011), 55-67. doi: 10.1016/j.jco.2010.10.002. |
[83] | Q. Wu, Y. M. Ying and D. X. Zhou, Learning rates of least-square regularized regression, Found. Comput. Math., 6 (2006), 171-192. doi: 10.1007/s10208-004-0155-9. |
[84] | D. H. Xiang, Conditional quantiles with varying Gaussians, Adv. Comput. Math., 38 (2013), 723-735. doi: 10.1007/s10444-011-9257-5. |
[85] | D. H. Xiang, ERM scheme for quantile regression, Abstr. Appl. Anal., (2013), Art. 148490, 6 pp. doi: 10.1155/2013/148490. |
[86] | T. Zhang, Statistical behavior and consistency of classificattion methods based on convex risk minimization, Ann. Statist., 32 (2004), 56-134. doi: 10.1214/aos/1079120130. |
[87] | D. X. Zhou, Capacity of reproducing kernel spaces in learning theory, IEEE Trans. Inform. Theory, 49 (2003), 1743-1752. doi: 10.1109/TIT.2003.813564. |
[88] | D. X. Zhou, Deep distributed convolutional neural networks: universality, Anal. Appl., 16 (2018), 895-919. doi: 10.1142/S0219530518500124. |
[89] | D. X. Zhou, Universality of deep convolutional neural networks, Appl. Comput. Harmonic Anal., 48 (2020), 787-794. doi: 10.1016/j.acha.2019.06.004. |
[90] | D. X. Zhou and K. Jetter, Approximation with polynomial kernels and SVM, Adv. Comput. Math., 25 (2006), 323-344. doi: 10.1007/s10444-004-7206-2. |