Kernel-based maximum correntropy criterion with gradient descent method

In this paper, we study the convergence of the gradient descent method for the maximum correntropy criterion (MCC) associated with reproducing kernel Hilbert spaces (RKHSs). MCC is widely used in many real-world applications because of its robustness and ability to deal with non-Gaussian impulse noises. In the regression context, we show that the gradient descent iterates of MCC can approximate the target function and derive the capacity-dependent convergence rate by taking a suitable iteration number. Our result can nearly match the optimal convergence rate stated in the previous work, and in which we can see that the scaling parameter is crucial to MCC's approximation ability and robustness property. The novelty of our work lies in a sharp estimate for the norms of the gradient descent iterates and the projection operation on the last iterate.

1. Introduction. Correntropy [19] is a widely used concept in information theoretical learning (ITL) [18], which is referred to as a robust nonlinear similarity measure between two random variables. Maximum correntropy criterion (MCC) is a principle induced by correntropy and provides a family of supervised learning algorithms. Recently, MCC has received increasing attention in domains of machine learning and signal processing for its effectiveness in the cases when data are disturbed by impulsive noise or containing large outliers [6,7,10,15]. Therefore, it usually serves as a robust alterative of the traditional least squares method or other convex optimization algorithms in learning processes.
In the framework of non-parametric estimation, denote X as the explanatory variable that takes values in a separable metric space X , Y ∈ Y ⊂ R as a real valued response variable and let ρ be the underlying distribution on Z := X × Y.
Here we focus on the application of MCC for regression that is linked to the additive data model where e is the noise and f ρ (x) is the regression function, that is the conditional mean E(Y |X = x) for given X = x ∈ X . The purpose of regression analysis is to study the functional relationship between X and Y. It leads to estimate the target function f ρ according to a set of samples z = {(x i , y i )} m i=1 ⊂ Z := X × Y drawn independently from ρ. For the least squares method, the loss function φ ls (u) = u 2 , u ∈ R is employed to measure the quality of the estimator. This method aims to 4160 TING HU minimize the variance of error variable and belongs to the second-order statistics. Its optimality relies heavily on the assumption of Gaussian distributions. However, non-Gaussian noise is ubiquitous in real world applications, for instance, artificial noise in electronic devices, atmospheric noises, and lighting spikes in natural phenomena [17,25]. Thus, the least squares method is not a good choice from a robustness point of view when the occurrence of outliers presents in the data. This motivates the application of MCC into the regression problems and the reference therein [5,6,18].
For a hypothesis function f : X → Y, with the scaling parameter σ > 0, correntropy between f (X) and Y is defined by where G(u) is the Gaussian function exp {−u} , u ∈ R. Given the sample set z, the empirical form of V σ isV When applied to regression problems, MCC aims to maximize the empirical corren-tropyV σ over a hypothesis space H, that is This scheme has succeeded in many real-world applications for its robustness to outliers or heavy-tailed distributions, e.g., wind power forecasting and pattern recognition [2,9]. To better understand MCC in the statistical learning framework, we define the loss induced by correntropy φ σ : R → R + as where σ > 0 is the scaling parameter. It is a fact that the loss function is a variant of the Welsch function [11] and the estimator f z,H of (1.1) is also the minimizer of the following empirical minimization risk scheme over H It should be noted that φ σ is not convex and satisfies the redescending property, that is, φ σ is non-decreasing near the origin, but decreasing toward 0 far from the origin. Various empirical and theoretical studies [4,6,8,13,24] have shown that non-convex losses with the redescending property can yield robust estimators. These work has further confirmed the effectiveness of MCC to deal with outliers or heavy-tailed distributions. From a computational perspective, the optimization problem arising from MCC is non-convex and the gradient descent method is a usual way to implement it. The purpose of this paper is to study the convergence performance of the gradient descent estimator for solving the empirical risk minimization scheme (1.2), where an RKHS is taken as the hypothesis space H. The recent work [8] also investigated the asymptotic behavior of the gradient descent algorithm generated by a family of robust loss functions. However, their analysis depends on the specific feature of the linear integral operator associated with the RKHS, that is restrictive in many situations. In our work, the rigourous error analysis is in terms of the covering number of the RKHS, which has been studied well in a lot of literatures. Here the convergence rate is capacity-dependent and nearly coincides with the optimal rate for regression problems. We also find that the projected last iterate can be a good estimator of the target function f ρ and as a byproduct, taking the iteration method, we have refined the uniform bound for norms of the gradient descent iterates in previous work [8,12,14].
The rest of paper is organized as follows. In Section 2, we introduce some necessary preliminaries and state our main results. Discussions and comparisons with related studies will be presented in Section 3. Section 4 is devoted to the proofs of our results.
2. Preliminaries and main results. Before giving our main results, we introduce some necessary preliminaries and notations. Throughout the paper, let ρ X be the marginal distribution of ρ on X and ρ(·|x) be the conditional distribution on Y for given x ∈ X . Denote · ρ as the L 2 -norm in L 2 ρ X space, which is defined by Let K : X × X → R be a Mercer kernel [1], i.e. a continuous, symmetric and positive semi-definite function. We say that K is positive semi-definite, if for any finite set {u 1 , · · · , u m } ⊂ X and m ∈ N, the matrix (K(u i , u j )) m i,j=1 is positive semi-definite. The RKHS H K associated with the Mercer kernel K is defined to be the completion of the linear span of the set of functions {K x := K(x, ·), x ∈ X } with the inner product ·, · K given by K x , K u K = K(x, u). It has the reproducing property for any f ∈ H K and x ∈ X . Denote κ := sup x∈X K(x, x). By the property (2.1), we get that We are now ready to state the gradient descent method for solving the scheme (1.2). Note that the loss induced by correntropy φ σ is differentiable for any u ∈ R. This together with the reproducing property (2.1), we can calculate the functional gradient of (1.2) at H K space and define the gradient descent algorithm for kernel MCC as follows.
the gradient descent algorithm for MCC regression starts with f 1 = 0, and for any 2 ≤ t ≤ T, where {η t > 0, t ∈ N} is the sequence of step sizes and φ σ ( The purpose of this paper is to investigate the convergence of the iterates {f t } T +1 t=1 generated by (2.3) when the iteration time T = T (m) is chosen properly according to the sample size m. A regression task is the concern of the algorithm (2.3), it is natural to use the mean squared error · −f ρ 2 ρ to measure its effectiveness or equivalently the excess generalization error with the least squares loss φ ls .

TING HU
Definition 2. With the least squares loss φ ls (u) = u 2 , u ∈ R, the generalization error for any f : X → R is defined by Obviously, the target function f ρ is the minimizer of E(f ) over the measurable space, that is The excess generalization error for any f : X → R is defined by E(f ) − E(f ρ ), and it is direct to calculate that Recall that |Y | ≤ B. It implies that the value of f ρ is supported on [−B, B] almost surely. So, we will choose the last iterate f T +1 with |f T +1 | ≤ B to approximate f ρ . Then we can make full use of the projection method introduced in [3].
Definition 3. For any f : X → R, the projected functionf is defined bŷ It is easy to see that for any fixed x ∈ X , y ∈ Y, |f ( These confirm that the projection operation will lead to better estimators. The same idea can also be found in [23,26]. Thus, in what follows, our approximation analysis aims at establishing bounds for Our convergence results will be stated under some conditions on the triplet (H K ,f ρ ,ρ X ). The first condition is about the decay of the approximation error [20]. Assumption 1. Let λ > 0 and f λ be the minimizer of the regularization error: The approximation error associated with the triplet (H K , f ρ , ρ X ) is defined by We assume that for some β ∈ (0, 1] and c β > 0, there holds The above assumption is common in the literature of learning theory. Assumption (2.8) always holds with β = 0. When the target function f ρ ∈ H K and H K is dense in C(X ) which consists of bounded continuous functions on X , the approximation error D(λ) goes to 0 as λ → 0. Thus, the decay (2.8) is natural and can be illustrated in terms of interpolation spaces [20].
The second condition is related to the capacity of the hypothesis space H K which is measured by empirical covering numbers of balls in the RKHS Assumption 2. Let G be a set of functions on X . The metric d 2,z on G is defined by We assume that for some ζ ∈ (0, 2), c ζ > 0, the covering number of the unit ball Besides empirical covering numbers, there are other standard tools to measure the complexity of the RKHS: uniform covering numbers (covering numbers of G under the metric · ∞ ), entropy numbers and effective dimension associated the integral operator L K : For the discussion of their equivalence, we can refer to the papers [23,14]. The analysis of this paper is mainly achieved by a concentration technique which involves the empirical covering number and often leads to sharp error bounds [26].
Our first theorem below provides a novel bound for the iterates {f t } T +1 t=1 on H Knorm, which improves previous uniform bounds for the iterates generated by various gradient descent algorithms. Without losing generality, let κ = 1 for simplicity.
Taking λ = T −(1−θ) , we can see that the upper bound for sup 1≤t≤T f t+1 K is − ) (ignoring the logarithmic factor). This bound is better than the uniform bound O(T 1−θ 2 ) in the existing literature [8,14,12] and will help us get a shaper estimate for the gradient descent method.
Form this theorem, we see that the value of the scaling parameter σ will affect the convergence rate. The relatively large σ will lead to the rate of order m − min{ 2β 2β+ζ − ,β} , which nearly matches the best capacity-dependent rate for regression problems by ignoring arbitrarily small ( [26] and p. 268, [23]). In particular, when f ρ ∈ H K , i.e. β = 1, the rate is O(m − 2 2+ζ + ). To further illustrate our result, we give the following example when the the regression function f ρ lies in some range space.
Example 1. Suppose that Assumption 2 holds with ζ ∈ (0, 2], and for any > 0. Due to the compactness of X , the integral operator L K : L 2 ρ X → L 2 ρ X induced by Mercer kernel K is self-adjoint, positive and trace class on L 2 ρ X , as well as restricting on H K . By the spectral theorem, the range space L β 2 K (L 2 ρ X ) is well defined. The index β measures the complexity of f ρ . The bigger the index β, the higher the regularity of f ρ . As we expect, the obtained convergence rate will improve if β increases.
3. Discussions and comparison with related work. In this section, we discuss and compare our results with related work. In the paper [6], Feng et al. studied the consistency of the empirical scheme (1.2) over a compact subspace H ⊂ C(X ) and obtained the explicit convergence rate under some complexity conditions on H. In the case of f ρ ∈ H K , under Assumption 2, their obtained rate is of order O(m − 2 2+ζ ) with a suitable chosen σ, which nearly coincides our result in (2.13) by ignoring small . When f ρ is out of H K , by tracing their proof of Theorem 4 carefully, we find that with confidence 1 − δ, their rate is bounded by where p is a power index of the polynomial increasing condition for the logarithm of the unform covering number N (B 1 , , C(X )). Let R = f λ K , where the regularization parameter λ is the tradeoff for the above error terms. So, the rate is up to . Note that for any function set F ⊂ C(X ), the empirical covering number N (B 1 , , d 2,z ) is bounded by 2β+ζ + ,β} ) in Theorem 1 is much better when σ is chosen properly.
Denote the set of positive eigenvalues of L K : L 2 ρ X → L 2 ρ X as {λ i } i , arranged in a decreasing order. Under the assumption that the eigenvalues λ i ≤ ci − 2 ζ with some c > 0 and ζ ∈ (0, 2) and |Y | ≤ B almost surely, Guo et.al [8] investigated the performance of the last iterates generated by the gradient descent method associated with a family of robustness losses. Applying their results to the loss φ σ induced by correntropy, if f ρ ∈ L β/2 K (L 2 ρ X ) with 0 < β ≤ 1, then with confidence at least 1 − δ, their rate is at most of order O(m − β 1+ζ ). According to the equivalence of the decay rate for the eigenvalues {λ i } i and empirical covering numbers, as shown in [14,22,23], we know that Assumption 2 holds with ζ = ζ if λ i = O(i − 2 ζ ). So, their rate is worse than that in (2.14). In addition, it should be mentioned that their work was carried out under the assumption that the target function f ρ lies in the range of some power of the integral operator L K . This assumption is strong for many commonly used RKHSs. For example, if K is Gaussian, then it requires f ρ ∈ C ∞ , that is restrictive for real problems. Our conditions in this paper are more general.
As mentioned in Section 1, MCC is usually taken as a good alterative of the least squares method. In what follows, we compare with the previous work on the least square. We first review the results on Tikhonov regularization with the least square loss function, i.e.
For a subset G of a metric space (H, d) , the n-th entropy number is defined by Note that the covering number and entropy number are equivalent (see [23], Lemma 6.21). Replacing the above decay assumption for E z [e n (G, , d 2,z )] with Assumption 2, the rate for f z,λ −f ρ 2 ρ still holds with ζ = ζ, that is of order O(m − min{ 2β 2β+ζ ,β} ). This has been confirmed by the work [26], which studied the Tikhonov regularization with multi-kernels. To our knowledge, it is the best rate for the Tikhonov regularized empirical risk minimization with the least square while no extra regularity condition is imposed on f ρ . Our result in (2.13) nearly matches it up to an arbitrarily small . Next we turn to iteration regularization scheme for the least squares loss [14], in which no constrain or penalization term is considered and the number of iterations serves as a regularization parameter. Like the algorithm (2.3) in our paper, it can be regarded as a gradient descent method to implement the empirical risk minimization for the least square. Under the Assumptions 1, 2 and |Y | ≤ B, it has been proved in [14] that the error rate for the last iterative of iteration regularization scheme is It is obviously inferior to our rate in (2.14).
At the end of this section, we would like to point out that, although our analysis was achieved by taking relatively large σ, the established convergence rates indicate that MCC's implementation with the gradient descent method is feasible in regression problems. It has been empirically reported that the robustness of the regression models induced by φ σ can be enhanced with a decreasing value of σ. Thus, the scaling parameter σ balances the robustness of the MCC algorithm (2.3) and its convergence rates, which plays a trade-off role in MCC's applications for regression problems. 4. Error estimations and proofs. In this section, we prove the results presented in Section 2. Let us start with the error decomposition for the algorithm (2.3). By (2.4), to estimate f T +1 −f ρ 2 ρ , our goal is equivalent to bound the excess generalization error E(f T +1 )−E(f ρ ), which is split as

Error decomposition.
where E z (f ) is the empirical generalization error of E(f ) according to the sample z, given as The first term of the righthand side of (4.1) is referred to as a sample error and quantifies the discrepancy between the excess generalization error of the estimator and the empirical counterpart. It can be studied by taking empirical process theory and various uniform laws of large numbers are provided to deal with it. Noting that f T+1 relies on the sample z, we shall bound the sample error in a function set, that is and R denotes the upper bound for the H K -norm of f T +1 such that f T +1 K ≤ R.
Here we are able to get the sharp sample error estimate by the projection operation and concentration inequality. The second term E z (f T +1 ) − E z (f ρ ) is called the computation error related to optimization, which plays an essential role in our analysis. To obtain a good computation error estimate, we introduce a reference function f * ∈ H K and get that To simplify the expression, we denote Collecting the above observations, we have the following lemma.

Lemma 1.
Suppose the bound f T +1 K ≤ R holds for some R > 0. For any given f * ∈ H K , we have From this lemma, we see that the best iteration time and corresponding convergence rate are achieved by suitably balancing various error terms in the above decomposition. In next subsections, we will estimate them item by item.

4.2.
Computation error for f * . In this subsection, we estimate the computational error E z (f T +1 ) − E z (f * ) for an arbitrary f * ∈ H K . To this end, we introduce the following lemma that gives an upper-bound of the iterates {f t } T +1 t=1 under the H K norm. Its proof is similar to that in [8] for bounding the iterates of gradient descent algorithm with robust losses. Here we omit it for simplicity. by (2.3). If the step sizes 0 < η t < 1 for any t ≥ 1, then we have In particular, if η t = ηt −θ with 0 < η ≤ 1 and 0 ≤ θ < 1, then The estimation of the computation error is based on some basic properties of quadratic functions, that will be shown in the proof.
where C is a constant independent of T, will be given in the proof.
Proof. Note that b 2 = a 2 +2a(b−a)+(b−a) 2 for any a, b ∈ R. Taking b = f t+1 (x i )−y i and a = f t (x i ) − y i , then by the property (2.1), Averaging the above equality with i = 1, · · · , m, by (2.2), we have where The algorithm (2.3) is rewritten as f t+1 = f t − η t W t . Putting this into (4.7) yields that where the last inequality is derived by 0 < η t ≤ 1. Now we consider the term f t+1 − f t , W t − V t K . Since the Gaussian function G(u) is Lipschitz, then for any u ∈ R,

TING HU
This together with (4.5) yields that Thus, we have that (4.8) According to the above inequality, it is not difficult to calculate that for any T > t, Introducing this inequality into (4.8), one gets It follows that Summing up over t = 1, · · · , T with f 1 = 0, For the term f t − f * , V t − W t K , using (4.5) and the Lipschitz property of G again, we find that Plugging (4.9) into the above relation, we have (4.13) Dividing both sides of the above inequality by T t=1 η t , so

Noting the inequality
, then we can get the bound (4.6) based on the above estimate. The proof is completed.

4.3.
Estimations involving empirical process theory. In this subsection, we proceed estimating the error terms M z (R) and F z (f * ) in (4.3), which depend on random sampling and can be studied by empirical process theory. To this end, we need some probability inequalities. The first one is the concentration inequality for the random variables in a family of functions, which can be found in [14,26].  . The second probabilistic inequality is a variant of one-side Bernstein inequality for the single random variable, whose proof will be provided in Appendix.
be a set of functions defined on Z and M , c > 0, τ ∈ [0, 1] be the constant such that ξ ∞ ≤ M and E(ξ) 2 ≤ c(Eξ) τ , then for any 0 < δ < 1, with confidence at least 1 − δ, there holds With the above preparations, we are first ready to estimate M z (R).
Lemma 6. Suppose Assumptions 1 and 2 hold. For any f ∈ B R , R ≥ 1, with confidence at least 1 − δ, where the constant C 1 is given in the proof.
Proof. We apply Lemma 4 to the function set Recall the fact f ρ ≤ B and |Y | ≤ B, we can get that It means the covering number condition holds with a = c ζ (4BR) ζ . Thus all the conditions in the Lemma 4 hold. We have that with confidence at least 1 − δ, for all g ∈ G, there holds , . So, we get that with confidence at last 1 − δ, there holds for f ∈ B R , which yields the desired conclusion (4.15) with C 1 = c ζ C 1 + 176B 2 . The proof is completed.
Next, we move on the estimation of the term F z (f * ). Observing the expression of F z (f * ), to get a better estimate by Lemma 5, we want to choose a reference function f * in H K that has a small generalization error. So, a natural choice of f * can be f λ , the minimizer of the regularized generalization error with a free regularization parameter λ. With this choice, the last error term in (4.3) can be controlled as Keeping this in mind, we also give the estimation of F z (f λ ) for any λ > 0 as follows.
Lemma 7. Suppose Assumptions 1 holds. For any f * ∈ H K and f * K ≤R with some constantR ≥ 0, then for any 0 < δ < 1, with confidence at least 1 − δ, there holds specially, let f * = f λ , then we have with confidence at least 1 − δ, where C 2 , C 2 are constants independent of m, δ,R, λ and will be given in the proof.

4.4.
Improving the bounds for iterates. According to the above estimations for the error terms in (4.3), with the latter choice f * = f λ , we know that a tight estimate for the H K -norm of the last iterate f T +1 will lead a better error bound for our algorithm (2.3). We shall prove the result in Theorem 1 and obtain the refined bound by using the following lemma iteratively. Proposition 1. Assume that there exists a constant R > 1 such that f t+1 K ≤ R for each t = 1, · · · , T. If the step sizes η t takes the form as η t = ηt −θ with η > 0 and 0 ≤ θ < 1, then for any f * ∈ H K and all 1 ≤ t ≤ T , with confidence at least 1 − δ, where the constant C 1 is given in (4.15) and ∆ R = max K + 2 f * 2 K , then we can obtain the bound (4.20). The proof is completed.
We observe that the estimators of F z (f * ), A(f * ) and ∆ t in (4.20) heavily depend on the choice of f * . Thus, this proposition asserts that for the suitable step sizes and scaling parameter, the sequence f t+1 K , t = 1, · · · , T can be well bounded if there exists some f * = f λ such that the second term of the righthand side of (4.20) is small. So, we shall choose a suitable parameter λ to prove Theorem 1.