Error analysis on regularized regression based on the Maximum correntropy criterion

This paper aims at the regularized learning algorithm for regression associated with the correntropy induced losses in reproducing kernel Hilbert spaces. The main target is the error analysis for the regression problem in learning theory based on the maximum correntropy. Explicit learning rates are provided. From our analysis, when choosing a suitable parameter of the loss function, we obtain satisfactory learning rates. The rates depend on the regularization error and the covering numbers of the reproducing kernel Hilbert space.

1. Introduction. The regression problem can be traced back to the linear regression with least squares. We will give an example to explain what is the regression and lead to the mathematical model for regression in learning theory. Assume the law can be expressed as a function f : R → R, and the function has a specific form f α (x) = N i=1 α i ψ i (x), where the ψ i are the elements of a basis of a specific function space. We need to learn the coefficients α := (α 1 , α 2 , · · · , α N ) from a set of data {(x 1 , y 1 ), . . . , (x m , y m ))}. y i = f (x i ), i = 1, . . . , m, if the measurements generating this set are exact. But in general the values y i to be affected by noise. So we computes the vector of coefficients α such that the value is minimized. This is the typical least squares regression method which can going back to Gauss and Legendre(see [2]).
Generally, we assume that X is a compact metric space and Y ⊂ R. Let ρ be a probability distribution on Z := X × Y, which is usually unknown. The generalization error ( [2]) for f : X → Y is defined as (f (x)−y) 2 is the error suffered from the use of f as a model for the process producing y from x, for each x ∈ X and y ∈ Y.
Let ρ(y|x) be the conditional probability distribution on Y and ρ X be the marginal probability distribution of ρ on X . Define f ρ : X → Y by The function f ρ is called the regression function which minimizes the generalization error.
Let z ∈ Z m , z = {(x 1 , y 1 ), . . . , (x m , y m )} be a sample in Z m . The empirical error of f is defined as In this paper, we consider the general model of regression where X is the explanatory variable taking values in X and Y ∈ Y is the response variable. From (1.1), it is easy to see that almost surely there holds f ρ = f . The target of the regression problem is to find a good approximation of regression function from random samples. The study of regression problems can be found in a large literature of learning theory (see [20] , [21] and [12] and references therein). As we know, the most employed methodology for quantifying the regression efficiency is the mean squared error. This is the classical tool that minimized the variance of f (x) − y and belongs to the second-order statistics(see [2]). The drawback of the second-order statistics is that its optimality depends heavily on the assumption of Gaussian noise. However, in many practical applications, data may be contaminated by non-Gaussian noise or outliers. To solve this kind of problems, Hu et al ( [10]and [11]) presented thorough studies on the minimum error entropy criterion for regression (MEECR) from a learning theory viewpoint. They presented the first results concerning the regression consistency and convergence rates by MEECR. This also motivates the introduction of the maximum correntropy criterion into the regression problems [5]. Recently, Feng and Ying([6]) investigated the behavior of correntropy based regression in the presence of outliers by using the Fourier analysis technique developed in Fan et al. ([3]) and by modeling outliers using Huber's contamination model. It was also demonstrated in Feng et al. ([4]) that correntropy based regression essentially regresses towards the conditional mode function when the scale parameter diminishes towards zero. Meanwhile, Lv and Fan([14]) provided the optimal learning with Gaussians and correntropy loss.
A generalized correlation function named correntropy ( [15]) is a generalized similarity measure between two scalar random variables T 1 and T 2 , which is defined by where E(·) denotes mathematical expectation, K σ (t 1 , t 2 ) = exp{−(t 1 − t 2 ) 2 /σ 2 } is a Gaussian kernel with the scale parameter σ > 0, (t 1 , t 2 ) being a realization of (T 1 , T 2 ). We call it the maximum correntropy criterion for regression (MCCR), when the correntropy V σ is used in regression problems. The correntropy contains higher-order moments of the probability density function and can apply to non-Gaussian regression problems. Recently, this kind of generalized correlation function (correntropy) has drawn much attention in signal processing and machine learning community. In [13], Liu et al. used the correntropy in non-Gaussian signal processing. The new method outperforms MSE(mean squared error) in the case of impulsive noise since correntropy is inherently insensitive to outliers. In [7,8], He et al. presented a sparse correntropy framework for computing robust sparse representations of face images for recognition. The new method can improve both recognition accuracy and receiver operator characteristic curves. Feng et al. [5] gave a theoretical understanding on the maximum correntropy criterion for regression(MCCR) within the statistical learning framework. The maximum correntropy criterion has also succeeded in some real-world applications(see [1]). We give the definition of correntropy loss function from [5] as follows.
Definition 1. The correntropy induced regression loss l σ : with σ > 0 being a scale parameter.
Definition 2. The risk functional for MCCR algorithms is given by The corresponding empirical risk on a set of observation z is defined as We also need the following concepts about covering numbers [2]: Definition 3. The covering number of the hypothesis space H, which is denoted as N (H, η), with the radius η > 0, is defined as where B(f, η) = {g ∈ H : ||f − g|| ∞ ≤ η} denotes the closed ball in C(X ) with center f ∈ H and radius η.
. . , x m } ⊂ X m . The l 2 -empirical covering number of the hypothesis space H, which is denoted as N 2 (H, η), with radius η > 0, is defined by By minimizing the empirical risk over a set of continuous functions H, we can obtain a predictor f z as Naturally, we want to know the performance of the predictor. For example, in the least squares regression the predictive power of a function f is measured by the mean squared error E(f ) = Z (y − f (x)) 2 dρ. From the definition of the regression function, we know that f ρ minimized the mean squares error. So we also can estimate the excess mean squares error i.e.
to measure the goodness of a predictor f .

BINGZHENG LI AND ZHENGZHAN DAI
However, existing theoretical results on understanding the loss l σ and MCCR model are very limited. The reason mainly lies in the non-convexity property of the loss function l σ . Feng et al [5] presented the consistency properties and the convergence rates of the MCCR model (1.2) under the assumption of H being a compact subset of C(X ). By choosing the suitable parameters σ, they give the learning rate to bound the difference between f z and f (X). They obtained the following main results [5].
where C H,ρ is a positive constant independent of m, σ or δ.
Notice that the MCCR model (1.2) is a constrained optimization model since H is assumed to be a compact subset of C(X ). Generally, a typical choice of H is a bounded subset of a certain reproducing kernel Hilbert space H K induced by some Mercer kernel K. In this paper, instead of evaluating the optimization model (1.2), we focus on the following unstrained version (1.3). That is to say we study the regularized MCCR algorithm based on minimizing E σ z . We want to emphasize that, in [5] the authors provided some numerical experiments by using the regularized regression model (1.3) (e.g., example of the noisy sinc function and example of the noisy Friedman's benchmark function). The experiments verified that regularized MCCR model gave the best fitting results, especially at positions where data are corrupted by outliers. This indicates that it is very necessary to establish the error analysis theory of the regularized MCCR algorithm based on minimizing E σ z . For this purpose, we need to introduce the definition of reproducing kernel Hilbert spaces. Let K : X ×X → R be continuous, symmetric and positive semi-definite, i.e., for any finite set of distinct points is positive semi-definite. Such a function is called a Mercer kernel. The reproducing kernel Hilbert space (RKHS) H K associated with the kernel K is defined to be the closure of the span of Definition 5. The regularized MCCR algorithm in an RKHS H K is defined as where λ > 0 is a regularization parameter.
To obtain the main results, we need some assumptions. Denote B R = {f ∈ H K : ||f || K ≤ R}. Throughout this paper, we assume that for some q > 0 and C q > 0, the covering number of B 1 satisfies This has been extensively studied in learning theory ( [24,25]). From that, we know for a C ∞ kernel, (1.4) is true for any q > 0.
Definition 6. The regularization error D(λ) is defined as where the regularizing function The regularization algorithm has been well understood in [16] and [12]. We shall assume that for some constants 0 < β ≤ 1 and C β > 0, (1.5) Throughout the paper we assume that for a constant M > 0, there holds |Y | ≤ M almost surely. It follows that ||f ρ || L 2 Our main results are stated as follows.
Then with confidence 1 − δ, we have Here C 1 and C 2 are constants independent of m, σ or λ and will be given explicitly in the proof.
From the Theorem, we can see that the decreasing value of the scale parameter σ yields slower convergence rates. But the decreasing value of σ enhance the robustness of the regression models. We realize that in the robustness literature, the scale parameter not only controls the robustness property of the regression model associated with the loss l σ but also specifies its efficiency( [9]). Hence σ balance the convergence rates and the robustness of the model. In the following theorem, we discuss the influence of the scale parameter σ on the convergence rates. and any 0 < δ < 1, with confidence 1 − δ we have Theorem 3. Assume the capacity condition (1.4) with q > 0 and the approximation condition (1.5) with 0 < β < 1. Take λ = m α− 1 1+q with 0 < α < 1 1+q . Then for σ ≥ m 1 1+q and any 0 < δ < 1, with confidence 1 − δ we have where C is a constant independent of m or δ.
Remark 1. When choose sufficiently large σ and suitable regularization parameter λ, from Theorem 3 we can see that the convergence rate of the regularized MCCR algorithm is at least of order O(m α− β 1+q ) for an arbitrarily small α > 0. When H K contains f ρ , β = 1 and the rate becomes O(m α− 1 1+q ). When f ρ ∈ H, let σ = O(m 1/(3+3q) ), the rate in Theorem A becomes O(m −2/(3+3q) ). When q → 0, the rate in Theorem A becomes O(m −2/3 ), our result becomes O(m α−1 ) which is better than the rate in Theorem A. In Theorem B, when s → 0, the rate becomes O(m −1 ), our result is comparable to the rate in Theorem B.
The rest of this paper is organized as follows. In Section 2, we give the estimate and analysis for error bounds and prove Theorem 1 and Theorem 2. In Section 3, we improve the error rate by iteration method and give the proof of Theorem 3. In the last section, we give some conclusion and some ideas for the future work.

Estimate and analysis for error bounds.
In this section we estimate and analyse some error bounds for the proof of our main results. We first show that the excess risk of f associated with the l σ loss is a good approximation of the least squares loss when the scale parameter is large. In this paper, the excess risk of f associated with the l σ loss refers to E σ (f ) − E σ (f ρ ) while the excess risk of f associated with the least squares loss refers to E(f ) − E(f ρ ).

Lemma 1. For any essentially bounded measurable function f on X , we have
Using the reproducing property and the Schwartz inequality, we have We finish the proof.
In next two subsections, we will estimate and analyse the sample error S 1 and S 2 .

2.2.
Bounding the sample error S 1 . We need one-sided Bernstein's concentration inequality [2]. Lemma 3. Let ξ be a random variable on a probability space Z with variance σ 2 , satisfying |ξ − E(ξ)| ≤ M ξ , almost surely for some constant M ξ and for all z ∈ Z. Then .
We define a random variable ξ(z) with z ∈ Z as Proposition 1. For any 0 < δ < 1, with confidence 1 − δ, Proof. We will use the one-sided Bernstein's inequality to the random variable ξ to bound the sample error S 1 . Hence, we need to verify conditions in Lemma 3.
Now applying Lemma 3 to the random variable ξ, for any ε > 0 with confidence at least .
Choose ε * to be the unique positive solution of the quadratic equation

BINGZHENG LI AND ZHENGZHAN DAI
Then with confidence 1 − δ, there holds And This proves the proposition.

2.3.
Bounding the sample error S 2 . For any f ∈ B R , we define the random variable on Z as follows Proposition 2. Assume the capacity condition (1.4). Let R ≥ 1, 0 < δ < 1, 1 m ≤ λ ≤ 1 and σ ≥ 1 satisfy Then there exists a subset of U R of Z m with measure at most δ such that for every z ∈ W (R)\U R we have where C 1 and C 2 are constants independent of R, m, σ or λ.
In order to prove proposition 2 we need following lemma. We use the method from [5] to prove this lemma.
Following from the proof of Proposition 1, we know that Applying the one-sided Bernstein's inequality to the following group of random variables ξ j (z) = −σ 2 exp{−(y − f j (x)) 2 /σ 2 } + σ 2 exp{−(y − f ρ (x)) 2 /σ 2 }, j = 1, . . . , J, we have the following conclusion Together with the fact that ε < E σ (f ) − E σ (f ρ ) + 2ε, we have For any f ∈ B R , if we have the following inequation then together with (2.8), (2.9) and (2.10), we have Using the above estimates, we have the following conclusion This proves the lemma. Now we begin to prove the Proposition 2.
Proof of Proposition 2. First we use the capacity condition (1.4) to bound the covering number in (2.6). We get the inequality . (2.11) The above inequality has the smallest positive solution can be bounded as where C 0 := max{ 20 the number ε satisfies the restriction (2.5) in Lemma 4, and with confidence 1 − δ 2 , there holds By Proposition 1 and Lemma 2, we know that there exists a subset of U R of Z m with measure at most δ such that for z ∈ W (R)\U R , we have which yields Applying Lemma 1 to f = f z,λ , using λ ≥ 1/m and the condition (2.12) for σ, we see that for z ∈ W (R)\U R where the constant C 2 is given by This proves the proposition.
Note that for any z ∈ Z m , we have E σ Proof of Theorem 1. From Proposition 2, let R = M √ λ , with confidence 1 − δ, there holds This proves the theorem.
Proof of Theorem 2. From Theorem 1 and (1.5), with confidence 1 − δ, there holds This proves the theorem.
3. Improving the error rate by iteration method. In this section we improve our error rate stated in Theorem 2. We will use the iteration technique which was introduced in [17].
Because L ≥