In this paper, we study the convergence of the gradient descent method for the maximum correntropy criterion (MCC) associated with reproducing kernel Hilbert spaces (RKHSs). MCC is widely used in many real-world applications because of its robustness and ability to deal with non-Gaussian impulse noises. In the regression context, we show that the gradient descent iterates of MCC can approximate the target function and derive the capacity-dependent convergence rate by taking a suitable iteration number. Our result can nearly match the optimal convergence rate stated in the previous work, and in which we can see that the scaling parameter is crucial to MCC's approximation ability and robustness property. The novelty of our work lies in a sharp estimate for the norms of the gradient descent iterates and the projection operation on the last iterate.
Citation: |
[1] | N. Aronszajn, Theory of reproducing kernels, Trans. Amer. Math. Soc., 68 (1950), 337-404. doi: 10.2307/1990404. |
[2] | R. J. Bessa, V. Miranda and J. Gama, Entropy and correntropy against minimum square error in offline and online three-day ahead wind power forecasting, IEEE Trans. Power Syst., 24 (2009), 1657-1666. doi: 10.1109/TPWRS.2009.2030291. |
[3] | D. R. Chen, Q. Wu, Y. Ying and D. X. Zhou, Support vector machine soft margin classifiers: Error analysis, J. Mach. Learn. Res., 5 (2004), 1143-1175. |
[4] | M. Debruyne, A. Christmann, M. Hubert and J. A. K. Suykens, Robustness of reweighted least squares kernel based regression, J. Multi. Anal., 101 (2010), 447-463. doi: 10.1016/j.jmva.2009.09.007. |
[5] | Y. Feng, J. Fan and J. A. Suykens, A statistical learning approach to modal regression, J. Mach. Learn. Res., 21 (2020), 1-35. |
[6] | Y. Feng, X. Huang, S. Lei, Y. Yang and J. A. K. Suykens, Learning with the maximum correntropy criterion induced losses for regression, J. Mach. Learn. Res., 16 (2015), 993-1034. |
[7] | Y. Feng and Y. Ying, Learning with correntropy-induced losses for regression with mixture of symmetric stable noise, Appl. Comput. Harmon. Anal., 48 (2020), 795-810. doi: 10.1016/j.acha.2019.09.001. |
[8] | Z. C. Guo, T. Hu and L. Shi, Gradient descent for robust kernel-based regression, Inverse Probl., 34 (2018), Art. 065009. doi: 10.1088/1361-6420/aabe55. |
[9] | R. He, W. S. Zheng and B. G. Hu, Maximum correntropy criterion for robust face recognition, IEEE Trans. Pattern Anal. Mach. Intell., 33 (2011), 1561-1576. doi: 10.1109/TPAMI.2010.220. |
[10] | R. He, W. S. Zheng, B. G. Hu and X. W. Kong, A regularized correntropy framework for robust pattern recognition, Neural Comput., 23 (2011), 2074-2100. doi: 10.1162/NECO_a_00155. |
[11] | P. W. Holland and R. E. Welsch, Robust regression using iteratively reweighted least-squares, Commun. Statist., 6 (1977), 813-827. doi: 10.1016/j.neucom.2016.12.029. |
[12] | T. Hu, Q. Wu and D. X. Zhou, Distributed kernel gradient descent algorithm for minimum error entropy principle, Appl. Comput. Harmon. Anal., 49 (2020), 229-256. doi: 10.1016/j.acha.2019.01.002. |
[13] | P. J. Huber, Robust Statistics., Wiley, New York, 2004. |
[14] | J. Lin, L. Rosasco and D. X. Zhou, Iterative regularization for learning with convex loss functions, J. Mach. Learn. Res., 17 (2016), 2718-2755. |
[15] | W. Liu, P. P. Pokharel and J. C. Principe, Correntropy: Properties and applications in non-gaussian signal processing, IEEE Trans. Signal Process., 55 (2007), 5286-5298. doi: 10.1109/TSP.2007.896065. |
[16] | I. Pinelis et al., Optimum bounds for the distributions of martingales in banach spaces, Ann. Probab., 22 (1994), 1679-1706. |
[17] | K. N. Plataniotis, D. Androutsos and A. N. Venetsanopoulos, Nonlinear filtering of non-gaussian noise, J. Intell. Robot. Syst., 19 (1997), 207-231. doi: 10.1023/A:1007974400149. |
[18] | J. C. Principe, Information Theoretic Learning: Renyi's Entropy Entropy and Kernel Perspectives, Springer, New York, 2010. doi: 10.1007/978-1-4419-1570-2. |
[19] | I. Santamaria, P. P. Pokharel and J. C. Principe, Generalized correlation function: Definition, properties, and application to blind equalization, IEEE Trans. Signal Process., 54 (2006), 2187-2197. doi: 10.1109/TSP.2006.872524. |
[20] | S. Smale and D. X. Zhou, Estimating the approximation error in learning theory, Anal. Appl., 1 (2003), 17-41. doi: 10.1142/S0219530503000089. |
[21] | S. Smale and D. X. Zhou, Learning theory estimates via integral operators and their approximations, Constr. Approx., 26 (2007), 153-172. doi: 10.1007/s00365-006-0659-y. |
[22] | I. Steinwart, Oracle inequalities for support vector machines that are based on random entropy numbers, J. Complexity, 25 (2009), 437-454. doi: 10.1016/j.jco.2009.06.002. |
[23] | I. Steinwart and A. Christmann, Support Vector Machines, Springer Science & Business Media, 2008. |
[24] | X. Wang, Y. Jiang, M. Huang and H. Zhang, Robust variable selection with exponential squared loss, J. Amer. Statist. Assoc., 108 (2013), 632-643. doi: 10.1080/01621459.2013.766613. |
[25] | B. Weng and K. E. Barner, Nonlinear system identification in impulsive environments, IEEE Trans. Signal Process., 53 (2005), 2588-2594. doi: 10.1109/TSP.2005.849213. |
[26] | Q. Wu, Y. Ying and D. X. Zhou., Multi-kernel regularized classifiers, J. Complexity, 23 (2007), 108-134. doi: 10.1016/j.jco.2006.06.007. |