Learning rates for the kernel regularized regression with a differentiable strongly convex loss

We consider learning rates of kernel regularized regression (KRR) based on reproducing kernel Hilbert spaces (RKHSs) and differentiable strongly convex losses and provide some new strongly convex losses. We first show the robustness with the maximum mean discrepancy (MMD) and the Hutchinson metric respectively, and, along this line, bound the learning rate of the KRR. We first provide a capacity dependent learning rate and then give the learning rates for four concrete strongly convex losses respectively. In particular, we provide the learning rates when the hypothesis RKHS's logarithmic complexity exponent is arbitrarily small as well as sufficiently large.


1.
Introduction. It is known that convex analysis is an effective tool in dealing with convex optimization problems. The learning problem which comes from many disciplines such as Psychology, Animal Behavior, Economic Decision Makings, Engineering, Computer Science and the study of human thought process is a kind of nonlinear optimization problem whose study needs knowledge of convex analysis, mathematical statistics, functional analysis and measure theory, and therefore has been investigated by many researchers ( [25,77]). The KRR model is one of the most important schemes in learning theory.
Let X be a compact subset of q-dimensional Euclidean space R q , Y be a nonempty closed subset contained in [−M, M ] for a given M > 0. Then it is known that the regression learning with Support Vector Machine (SVM) is to find a function f (x) between the input x ∈ X and the output y ∈ Y from a hypothesis space such that its values correspond to the condition mean of y or closely related quantities. The function can be modelled by a positive probability distribution (measure) ρ(x, y) = ρ(y|x) ρ X (x) on Z := X × Y, where ρ(y|x) is the conditional probability of y for a given x and ρ X (x) is the marginal probability of x.
Generally speaking, the distribution ρ is unknown and what one can know is a sample z := {z i } m i=1 ∈ Z m independently drawn (i.i.d.) according to ρ. Given a sample z, the regression problem aims at finding a function f z : X → R such that where 0 < λ < 1 is a given regularization parameter which is commonly used to overcome the ill-posedness. It is known that the most ideal regression function f V ρ (x) is and the minimum is taken over all measurable functions with respect to ρ. It is known that if V (t) = t 2 , then ( [25,26]) where For the needs of the statements we define the integral risk minimization scheme by f V ρ,λ : = arg min (1.5) In particular, we define f ρ,λ : = arg min for a real coefficient vector α = (α 1 , α 2 , · · · , α m ) .
The concept of stability is often used to describe the dependence of the optimizer on the parameters qualitatively ( [12]), for example, for the optimization problem V (u) = arg min x∈X ϕ(x, u), u ∈ D, where u is a parameter varying in a given domain D.
To discuss the continuity of V about u on D belongs to the scope of stability. By robustness we mean to describe the dependence of V on the parameter u quantitatively.
In the field of kernel regularized learning, the robustness of f V ρ,λ on the distributions ρ is systematically studied by I. Steinwart and A. Christmann. They show that if V is a differentiable convex function, then (Theorem 5.9 of [72]) where γ is an another probability distribution on Z, f V γ,λ is the corresponding optimization solution defined in the same way as that of f V ρ,λ in (1.5) ( [18,19,21,20,32,39,23,28,45,69]).
Let P Z be the set of all probability measures on Z, d(·, ·) be a metric on P Z . Then we expect to form a robust method so that d(ρ, γ) can be used to describe the right side of (1.7) quantitatively.
Let µ ∈ P Z be a probability measures on a Banach space E. We define the norm of total variation as |µ(E i )| : E 1 , E 2 , · · · , E n is a partition of E .
Theorem 10. 15 of [72] shows that where C(V, λ) > 0 is a constant depending upon V and λ, see also [17]. If V is a Lipschitz loss, then (Theorem 10.27 of [72] or [18,19]) where C γ,λ = λ −1 K ∞ |V | 1 γ − ρ P Z and |V | 1 = max x∈R |V (x)|. When V is a Lipschitz loss, [22] provides a bound for f V ρ,λ − f V γ,λ ∞ . i.e., where and C (V,λ) is a constant depending upon λ and V . Similar estimates are given in [24] for kernel regularized pairwise regression. The Lipschitz assumption is a strong condition since even V (t) = t 2 is not a Lipschitz function on R. An interesting problem is if the assumption that V (t) is a Lipschitz loss can be removed or weaken, then what we can say about the estimate? This is the first motivation for writing this paper. Another research topic in learning theory is to bound the error between f V z,λ and f V ρ in probability. Many mathematicians have devoted to this field and many approaches have been developed, see e.g., the convex analysis approach ( [19,72]), the capacity approach ( [15,25,26,27,78]), the integral operator approach ( [44,53,65,74]), the spectral learning approach ( [2,6,29]) and the optimization approach ( [5,86]). To investigate the optimal learning rates for algorithm (1.3), many researchers choose the quantile loss ( [21,23,70,71,84]) since it has the Lipschitz property. It is known that the least square loss V (t) = t 2 has many nice properties, the learning approach based on it have been studied by many researches ( [9,10,14,35,33,36,42,43,59,73,80,83]). When V is a general convex loss, the error has also been studied in many papers ( [19,20,46,75,76]).
There are three issues that deserve our attention. (i) It is known that for V (t) = t 2 the errors are described by norm · L 2 (ρ X ) due to the famous comparison equality ( [25]) Since many convex losses have no such comparison equalities, many papers describe the error in expectation: Recently, some mathematicians have paid their attention to build the comparison inequality of the type For example, some comparison inequalities for the pinball loss are established under certain realistic assumptions on ρ ( [35,70,84,85]). It will be of interest if we can establish certain comparison inequality similar to (1.12) for the strongly convex losses. This is the second motivation for writing this paper. (ii) Although the error bound for a convex loss has been estimated by many papers, we find that the typical examples of losses are in fact not many as we hope. It is known that the losses satisfying the Lipschitz condition may provide nice learning rates ( the pinball loss in [21]). Are there other losses which do not satisfy the Lipschitz property but can attain the same learning rate as that of V (t) = t 2 ? It is known that strongly convex losses have many advantages. In particular, the Bregman distance has been used for clustering ([4]). Whether or not the kernel regularized regression associating them can attain the same learning rate as that of V (t) = t 2 . This is the third motivation for writing this paper. (iii) We notice that, to provide the learning rate for a general convex loss, the variance-expectation bound condition is needed for the loss ( [75,76]). This assumption always holds if V is a strongly convex loss (P.338 of [90]). Then (1.13) is not a necessary assumption. To provide learning rates without the variance-expectation bound condition (1.13) is an interesting topic. This forms the fourth motivation for writing this paper.
The contributions of this manuscript are three folds. First we provide two kinds of explicit upper bounds for the error f V ρ,λ − f V γ,λ L 2 (ρ X ) in MMDIPM (maximum mean discrepacy integral probability metric) and the Hutchinson metric respectively when V is a differentiable convex loss and its conjugate loss V * is a strongly convex loss. Second, we provide three new differentiable strongly convex losses. Some of them can attain the same learning rate as that of V (t) = t 2 in the setting of the present paper. Third, we find that if V is a convex loss and its conjugate loss V * is a strongly convex loss, V is an important quantity in describing the learning rate, which may lead to new metrics, e.g., the metrics ln 2 | x y | and |x| √ 1+x 2 on R. In a word, we find in this manuscript that the strong convexity of the conjugate loss V * of a convex loss V has more contributions to the learning rate than that of the strong convexity of V itself.
The paper is organized as follows. In Section 2 we shall provide some notions and concepts of convex analysis. Three new convex losses and their conjugate losses will be provided. In Section 3 we shall provide some bounds for metric with MMDIPM and the Hutchinson metric respectively. The upper bounds are provided explicitly with the help of the K-functional defined in learning theory.
In Section 4 we shall show the learning rates for some strongly convex losses. We first give a general estimate for metric in case that V is a differentiable convex loss and its conjugate loss V * is a strongly convex loss with convex modulus 1 c and c > 0 and then provide some learning rates for four concrete differentiable convex losses. In Section 5 we shall give some discussions for the obtained results. Some lemmas used in proving the main results are given in Subsection 6.1. All the proofs are given in Subsection 6.2.
Throughout the paper, we assume that ρ has finite second moment,i.e., Z y 2 dρ < +∞. We say A = O(B) if there exists a constant C > 0 such that A ≤ CB. We say A ∼ B if both A = O(B) and B = O(A). For any a = (a 1 , a 2 , · · · , a q ) , b = (b 1 , b 2 , · · · , b q ) ∈ R q , we define in this paper 2. Notations and concepts. Let P X be the set of marginal probability distributions on X. Then we can define on P X a metric with characteristic kernels. We call a bounded measurable positive definite kernel K a characteristic kernel if by ρ X , γ X ∈ P X and X K x (·)dρ X − X K x (·)dγ X K = 0 we have ρ X = γ X , which is not only equivalent to the fact that the kernel mean embedding is injective, but also is equivalent to the universality of the kernel ( [47,49,67,68]), where we say a Mercer kernel K x (y) = K(x, y) is universal if for any given subset W ⊂ X, any positive number ε and any function f ∈ C(W ) there is a function g having the form of (1.6), i.e., g( One can find such kind of kernels from [55,57,60]. Define the maximum mean discrepacy (MMD) integral probability metric (IPM) between ρ X and γ X as

BAOHUAI SHENG, HUANXIANG LIU AND HUIMIN WANG
Then, by Section 3.5 of [48] or Theorem 1 of [68] we know To get the bounds of f V ρ,λ − f V γ,λ K with d K (ρ X , γ X ) is a topic deserves us to consider.
Let B be a metric space with metric d(x, y). We denote by C(B) the set of all the continuous functions f : B → R. We say f ∈ C(B) has Lipschitz property of We call C(f ) a Lipschitz constant. In particular we denote Define the Hutchinson metric as We call a function f : X → R a strongly convex function on X if for any x, x ∈ X and any λ ∈ [0, 1] there holds , where c > 0 is the convex modulus. If c = 0, we call it a convex function.
By Proposition 10.6 and Proposition 17.10 of [7] we know if f is a differential convex loss on X, then it is a strongly convex function with convex modulus c if and only if for any x, For a function f : R q → R ∪ {+∞} not identically +∞ the conjugate function f * is defined by where domf = {x ∈ R q : −∞ < f (x) < +∞} and ·, · 2 is the inner product in R q .
It is known that f is convex and lower semicontinuous if, and only if (f * ) * = f. By the N -function theory of Orlicz space (Chapter 1 of [52]) we know that if V : R → R is an even convex function, then its conjugate function V * : R → R is From now on we shall denote by V * the conjugate function of a function V .
It is known that for M (u) = |u| p p , p > 1, we have M * (u) = |u| p p , p = p p−1 . We show in Lemma 6.9 that for the convex loss the conjugate loss is V * 1 (t) = e |t| − e −|t| − 2, it is a strongly convex loss on R with convex modulus 2.
By the Example 11 in Chapter 1 of [52] we know the convex loss has a conjugate loss V * 2 (u) = e |u| − |u| − 1, it is a strongly convex loss on R with convex modulus 1.
The convex loss ( [66]) is a strongly convex loss with convex modulus 1 (see Lemma 6.10). By Theorem 4.2.1 of [41] we known if f * is a differentiable strong convex with convex modulus 1 c on R q , then ∇f is Lipschitzian with constant c on R q : for all (x ,x) ∈ R q ×R q . It follows, in this case, that (from the page 120 of [41]) and (Lemma 1.30 of [50]) Inequality (2.7) is more useful than inequality (2.1) in setting of this paper. It will lead to a new metric (1.15) and a comparison inequality (5.6).
It is known that if V is a differentiable even convex function on (−∞, +∞), then V is an odd and increasing function on (−∞, +∞) and |V (t)| = V (t) for t ≥ 0.
The K-functional is a key quantity in learning theory ( [16,27,30,55,58]). It shows the approximation ability of H K with respect to L 2 (ρ X ) and is influenced by the loss V (t). In particular when V (t) = t 2 it becomes To show an explicit learning rate for the algorithm (1.3), one often use an assumption ( [63,61,64,73,75,76]): Assumption. There are 0 < β ≤ 1 and c β > e such that is unbounded unless β = 1, we assume in this paper, without loss of generality, that where M is the given constant in the symbol Y and Z.
ρ,λ and f V γ,λ be the solutions of algorithm (1.5) with respect to ρ and γ respectively. If V is a differentiable convex loss and its conjugate loss V * is a differentiable strongly convex loss with convex modulus 1 c and V is continuous on R, then By (3.1) we know V is a key factor in quantitatively describing the robustness.
To show the robustness of f V ρ,λ about ρ in metric D H (ρ, γ), we borrow the concept of separating kernels ( [31]). We call a Mercer kernel K : X × X → R a separating kernel if for all x, y ∈ X with x = y we have K x = K y . By the Proposition 1 of [31] we know if K is a separating kernel, then We use this distance to define the Hutchinson metric D H (ρ, γ) on P Z and use it to describe the robustness for f V ρ,λ .
Theorem 3.1. Let K : X × X → R be a separating kernel, f V ρ,λ and f V γ,λ be the solutions of algorithm (1.5) with respect to ρ and γ respectively. If V is a differentiable convex loss and its conjugate loss V * is a differentiable strongly convex loss with convex modulus 1 c and V is continuous on R, then We give several estimates under the condition that K : X ×X → R is a separating kernel.
• When V (t) = t 2 2 , we have In this paper, we denote by f V3 ρ the solution of the equation By (3.4) and (3.10) we know if V is bounded, then the robustness is controlled by the marginal distribution.
4. Learning rates. To give a capacity dependent learning rate for algorithm (1.3), we borrow the concept of covering number.
Let S be a metric space and η > 0. Then, the covering number N (S, η) is defined to be the minimal positive integer number l such that there exist l disks in S with radius η covering S.
We say a compact subset E in a metric space (H, · H ) has logarithmic complexity exponent s ≥ 0 if there is a constant c s > 0 such that the closed ball of radius r centered at origin, i.e., Let X ⊂ R q be a compact set, for a positive integer s we denote by C (s) (X) the space of functions on X that are s-times differentiable and whose s-th partial It is known by Theorem 5.1 of [27] that if K ∈ C (s) (X × X), then H K has the logarithmic complexity with exponent 2q s . It is proved in [87] that for a C ∞ (X × X) kernel (such as Gaussians) (4.1) is valid for an arbitrarily small s > 0.
Let C ([a, b] q , B, c) denote the class of real-valued convex functions defined on [a, b] q that are uniformly bounded in absolute value by B and uniformly Lipschitz with constant c. Then by [37] or Theorem 6 of [13] we know Also, by Theorem 5.8 of [27] we know if X is a closed subset of R q with piecewise smooth boundary, and K : (4.2) and (4.3) show that the exponent s depends upon the dimension q. It increases to +∞ if α → 0 + or q → +∞. These facts encourage us to consider the learning rate for the cases of K ∈ C ∞ (X × X) and K ∈ Lip * [α, C(X × X)] with α → 0 + . We first give a general learning rate for algorithm (1.3).
Proposition 4.1. Let f V z,λ and f V ρ be the solutions for algorithms (1.3) and (1.4) respectively. If V is a differentiable convex loss and its conjugate loss V * is a differentiable strongly convex loss with convex modulus 1 c and V is continuous on R. Assume H K has the logarithmic complexity with exponent s > 0,i.e., (4.1) holds if (4.5) holds.
Moreover, we give the following theorems.
The similar results as that of the right-sides of (4.12)-(4.13) also hold for V 3 (t).
It can be seen that the loss V 3 provide faster learning rates than that of the least square loss in the setting of this paper.
5. Further discussions. We now give some further analysis for the above results.
• On the robustness. In this paper, by robustness we mean the quantitative description of the dependence of f V ρ,λ on P Z , which is different from the robustness in the kernel based robust regression ( [38,79]), where the robustness is the description of how the parameters in the windowing loss influences the learning rates.
• Comparison the learning rates with the related literature. For V (t) = t 2 Theorem 2 of [82] shows that if K satisfies (4.1) and (2.9) holds, then It is easy to see that (4.10) is nicer than (5.1) for sufficiently large s. Also, if K satisfies (4.1) and (2.9) holds, then [83] shows that  [35] and shows that for f ρ ∈ B α 2,∞ (whose definition is in the appendix) a learning rate Then the right side of (4.9) is faster than (5.4) for sufficiently small s > 0 and q > 6. The right side of (4.10) is faster than (5.2) for sufficiently large s. Above comparisons show that the error estimates in this paper have certain advantages over some in the literature. These advantages arise from the advantages of strong convexity. Because of the strong convexity of V * , the learning rate is contributed by both E and G (see the proof of Theorem 4.1) and V has the Lipschitz property.
• On the comparison inequality. By (iv) of Lemma 6.1 we know if V is a strongly convex loss with convex modulus c > 0, then Also, if V is a differentiable convex loss and its conjugate loss V * is a differentiable strong convex loss with convex modulus 1 c , then we have the following comparison inequality with respect to V Although both inequalities (5.5) and (5.6) are comparison theorems for a strongly convex loss, they are essential different. Inequality (5.6) is more practical than (5.5) when V * is a differentiable strongly convex loss, usually V will satisfy Lipschitz condition and has high decay, however, (V * ) itself often has a high increasing rate, e.g. V * (t) = e t 2 2 , the kernel regularized algorithm associating with it usually has very slow convergence rate ( [56,62]).
• On the variance-expectation bound condition (1.13). We point out here that the variance-expectation bound condition (1.13) is satisfied by many convex losses (chapter 10.4 of [27]), in particular, when V (t) = t 2 we have The variance-expectation bound condition (1.13) is an outcome of the capacity approach due to the error decomposition. In the approaches of the convex approach and the integral operator approach, it won't be necessary. It can be applied to a large class of convex losses, but for some concrete losses it can not provide the optimal learning rate. For example, if V satisfies (1.13), (4.1) and (2.9), then [75] shows that If V (t) = t 2 , then (5.8) becomes (since α = 1) (5.9) The right side of (5.9) is nice for very small s > 0, but is very weak for sufficiently large s and is not as nice as that of (4.10).
• On the K−functional By the inequality (6.7) in Lemma 6.1 we have The decay rate for the right side of (5.10) may be described by a modulus of smoothness ( [58]).
• On the characteristic kernels. It is known that the characteristic kernel theory is based on the universality of Mercer kernels. The universality of the deep convolution kernels (neural networks) has been studied by many mathematics ( [51,81,88,89]). The kernel regularized learning approach for multi-layer convolution kernel has been investigated by [8,11] et al. It will be an interesting topic if one can extend the characteristic kernel theory to the deep convolution kernels (neural networks). 6. Proofs. To prove the theorems in Section 3 and Section 4, we need some lemmas. 6.1. Lemmas. Lemma 6.1. Let V be a differentiable convex loss and V be continuous on R, let f V ρ,λ be the optimal solution of scheme (1.5). Then (i) there holds (v) if V is a differentiable convex loss and its conjugate loss V * is a differentiable strongly convex loss with convex modulus 1 c , then (vi) if V is a differentiable convex loss and its conjugate loss V * is a differentiable strongly convex loss with convex modulus 1 c , then Proof. (i) By inequality (1.2) and the definition of f V ρ,λ we have (ii) Since V is a differentiable convex loss we have by the (ii) of Proposition 7.1 in Appendix that It follows We have by (6.9) and (6.8) that By the definition of Gâteaux derivative we have (6.2). When g ∈ H K we have So (6.4) holds.
(iv) By the definition of f V ρ we have On the other hand, by the property of Gâteaux derivative we have for z = (x,y) ∈ Z that It follows where in the last derivation we have used (6.13).
(v) Since V * is a differentiable strongly convex function loss with convex modulus (6.14) If follows where we have used (6.13) as well.
Lemma 6.2. Let ρ, γ be distributions on Z = X×Y . f V ρ,λ and f V γ,λ are the solutions of scheme (1.5) for ρ and γ respectively. If V is a differentiable convex loss and its conjugate loss V * is a differentiable strongly convex loss with convex modulus 1 c , then and (1.7) holds as well.
(1.7) and (6.15) show the robustness of the solutions according to the distributions.
Proof. Since V * is a strongly convex function with convex modulus 1 c , we have by By (iii) of Proposition 7.1 in Appendix we have Therefore, where in the last deduce we have used (6.4).
The definitions of f V ρ,λ and f V γ,λ lead to where in the last derivation we have used Cauchy's inequality. We thus have (1.7) and (6.15) by (6.16).
Lemma 6.3. Let K be a dρ X -measurable and bounded kernel on X × X, H K be its reproducing kernel Hilbert space, M be a bounded function on X. Then for any Proof. (6.17) is a generalization of the Lemma 26 of [67]. Define a linear functional by Then Therefore, T ρ X (f ) is a bounded linear functional on H K . By the Riesz representation theorem there exists a unique λ ρ X ∈ H K such that T ρ X (f ) = f, λ ρ X K , ∀f ∈ H K . In particular, for K x (·) ∈ H K we have By the reproducing property we have λ ρ X = X M (u)K x (u)dρ X , and (6.17).
Lemma 6.4. Let b be a bounded function on Z. Then Replacing M in (6.17) by Y b(x, y)dρ(y|x) and Y b(x, y)dγ(y|x) respectively we have (6.18) thus holds.
Lemma 6.5 (Proposition 3.13 of [27]). Let F be a family of functions from a probability space Z to R and d(·, ·) a metric on F. Let U ⊂ Z be of full measure and constants B, L > 0 such that (i) |ξ(z)| ≤ B for all ξ ∈ F and all z ∈ U, and Then for all > 0, Lemma 6.6. Let F be the family of functions defined as in Lemma 6.5. Let V be a convex loss on R with c−Lipschitz continuous. For the class V F consisting of loss . (6.20) Proof. Since |V (y − η(z))| ≤ V (2B), we have by (6.19) that . (6.21) By (2.6) we know for η , η ∈ F there holds ). Let ε i , δ i < 1 be real numbers and 0 < ε i , 0 < δ i < 1, (i = 1, 2).
Lemma 6.9. The function is an even convex loss on R, V 1 is a Lipschitz function with Lipschitz constant 1 2 , whose conjugate loss is which is a strongly convex loss with convex modulus 2.
Lemma 6.10. The function is an even convex loss on R. V 3 is Lipschitz with Lipschitz constant 2, whose conjugate function is which is a strongly convex loss with convex modulus 1.
Proof. By the definition we have Lemma 6.11. Let ξ be a random variable taking values in a real separable Hilbert space H on a probability space (Ω, F, P ). Assume that there are two positive constants L and σ such that
Proof of Theorem 3.1 i.e. Proof of (3.3). Since V is a Lipschitz function with a Lipschitz constant c > 0, we have for any (x, y), (x , y ) ∈ Z and a given h ∈ H K satisfying h K ≤ 1 that where we have used (6.1) and the inequality f, In the same method, we have Proof of (3.4). Replacing sgn(V y − f V γ,λ (x) )sgn(h(x))h(x) with h(x), we have where we have used the fact that By the same way, replacing sgn( Proof of (3.5).