Convergence of online pairwise regression learning with quadratic loss

Recent investigations on the error analysis of kernel regularized pairwise learning initiate the theoretical research on pairwise reproducing kernel Hilbert spaces (PRKHSs). In the present paper, we provide a method of constructing PRKHSs with classical Jacobi orthogonal polynomials. The performance of the kernel regularized online pairwise regression learning algorithms based on a quadratic loss function is investigated. Applying convex analysis and Rademacher complexity techniques, the bounds for the generalization error are provided explicitly. It is shown that the convergence rate can be greatly improved by adjusting the scale parameters in the loss function.


1.
Introduction. In recent years, online learning algorithms have been attracting the attentions of a lot of researchers in statistical learning theory(see e.g. [21,58,71,73,74] and references therein). In present paper, we shall give an investigation on the convergence analysis of kernel-based regularized online pairwise learning algorithm associating with the quadratic loss.
1.1. Online learning algorithms. Let X be a given compact set in the d-dimensional Euclidean space R d , Y ⊂ R. Let ρ be a fixed but unknown probability distribution on Z = X × Y which yields its marginal distribution ρ x on X and its conditional distribution ρ(· | x) at x ∈ X. Then we denote by {z t = (x t , y t )} T t=1 the sample drawn i.i.d.(independently and identically distribution) according to distribution ρ. The aim of regression learning is to find a predictor f : X → R from a hypothesis space such that f (x) is a "good" approximation of y. Let V (r) : R → R + be an even loss function. It is well known that V (r) is an odd function. The quality of the predictor f is measured by the generalization error In the theory of kernel regularized learning, RKHSs are often chosen as the hypothesis spaces. The batch algorithms based on a RKHS H K is defined as ( [20]) (1.2) and the data-free analogue where λ > 0 is a regularization parameter.
Contrast to the batch learning which tackles the whole samples in a batch, online learning processes the samples one by one and update the output in time. For any f t ∈ H K , by the results of functional gradient (see [42,50]), we know that the gradient of E(f ) + λ 2 f 2 K at the point f t with respect to f is Using the classical gradient descent method [9], we can get the sequence {g t : g t ∈ H K }, which provides an approximation to f λ , by the following iterative formula g 1 = 0, g t+1 = g t − η t Z V (g t (x) − y)K x (·)dρ(x, y) + λg t . (1.4) However, since the distribution ρ is unknown, the algorithm (1.4) can not be implemented directly. Based on the Stochastic Gradient Descent Method (see e.g. [23,29,48]), the integral in (1.4) is replaced by the value V (f t (x t ) − y t )K xt (·), the kernel-based online learning algorithm is given by (see e.g. [58]) where η t is called the step size and the sequence {f t : t = 1, · · · , T + 1} is the learning sequence. The convergence rate of the online learning algorithm (1.5) has been extensively studied in the literature (see e.g. [38,52,58,69,71,75]). The research results in [38] show that under mild conditions the regularized online learning algorithm (1.5) can converge comparably fast as batch learning algorithm (1.2).
1.2. The kernel regularized online pairwise learning algorithm. Recently, the kernel-based regularization pairwise learning algorithms have been considered in [16,17], whose background comes from many problems such as the ranking problem (see e.g. [1,12,13,18,35,47,76]), the similarity learning (see e.g. [11,38]), et al. An important innovation of pairwise learning from the usual online learning is the use of the pairwise reproducing kernel Hilbert space (PRKHS) H K , which is often reproduced by an unique symmetric and positive definite continuous function K : X 2 × X 2 → R, called pairwise reproducing kernel or pairwise Mercer kernel(see e. g. [10,13,17,27,45]).
The reducing function δ : Y ×Y → R chosen according to the learning task, is a new terminology making pairwise learning essentially different from that of pointwise learning. For any hypothesis f ∈ H K , if the inducing function δ(y, y ) is a symmetric function, we naturally hope that f is a symmetric function, i.e. f (x, x ) = f (x , x), ∀x, x ∈ X, and if the inducing function δ(y, y ) is an anti-symmetric function, for example δ(y, y ) = y − y , we hope that f is an anti-symmetric function as [10,13,18]). So it is valuable if we can provide a way of constructing PRKHS whose functions have the symmetric or anti-symmetric property. This is the first motivation for us to write this paper. Online pairwise learning depends upon a sequence of samples z = {z t } T t=1 = {(x t , y t )} T t=1 . At each time step t = 2 . . . T, the algorithm posits a hypothesis f t ∈ H K upon which the next sample z t is revealed. And the algorithm incurs the following local regularized empirical error, on which the quality of the hypothesis f t is assessed (see e.g. [15,72]).
where λ > 0 is a regularization parameter.
To update the current predictor f t , an iterative step is made based on the negative gradient of the above local regularized empirical error E t λ (f ) at f = f t , which is exactly defined by(see [42,46]) Using the above gradient descent method, the general form of online kernel-based regularization pairwise learning algorithm is defined as follows.

SHUHUA WANG, ZHENLONG CHEN AND BAOHUAI SHENG
For a hypothesis f , we define the generalization error or risk associated with the loss function V as and the regularization generalization error or risk is defined as To study the learning performance of online pairwise learning algorithms we need to bound the convergence rate of the iterative sequence {f t : t = 1, · · · , T + 1}. Online pairwise learning in a linear space was investigated in [64], and the generalization bounds for the average of the iterates were established requiring the uniform boundedness of the loss function. Instead of the average of its iterates, recent literatures considered the last iterate of online pairwise learning algorithms, and the research results show that the performance of the last iterate is competitive to that of the average of iterates ( [15,26,39,72]). [72] studied the convergence rates of online pairwise learning with the least squares loss a novel error decomposition was given and the explicit error bounds of O(T − β 2β+2 ) were presented with the pairwise regression function f ρ ∈ L β K (L 2 ρ ). The convergence rate of online pairwise learning algorithm for binary classification was established in [26] by using strong convexity and re-weighting techniques. [39] investigated the unregularized online pairwise learning algorithms with a general convex loss satisfying an increment condition, the convergence rate O(T − β 2β+1 log T ) was established with the step size η t = η 1 t −θ , θ = 2β+1 2β+2 . It is known that the learning rates of a kernel-based regularization algorithm are influenced by the geometry properties, e.g. the capacity, the covering number, the uniformly convexity ( [4,5,20,40]). Some other parameters with respect to the RKHS also influence the learning rates. For example, it is shown by [37,65,68,70] that the flexible variance σ in the Gaussian kernels greatly influence the learning rates. The recent researches on the semi-supervised learning show that the parameters in the loss function also influence the learning rates (see [54]). [63] studied the convergence of the online pairwise algorithm with varying regularization parameters.
On the other hand, we find that the quadratic function √ 1 + t 2 , t ∈ R plays an important role in constructing shape preserving quasi-interpolation and solving partial differential equations with meshfree method since its strong nonlinear property and its convexity(see [14,25,66,67]), and it has been used by [56] as the loss function which shows some advantages in forming semi-supervised learning algorithms. Encouraged by these researches, we want to make an investigation on how these parameters influence the learning rates of the online pairwise learning. This is the second reason for writing this paper.
The rest of this paper is organized as follows. In Section 2 we state the theory of bivariate orthogonal polynomials and give some examples of bivariate pairwise Mercer kernels. In Section 3, we consider the online pairwise learning algorithm associated with the quadratic loss function V σ (r) := σ 2 ( 1 + ( r σ ) 2 − 1) with parameters σ ∈ (0, 1]. Unlike pointwise learning, pairwise learning usually involves pairs of training samples that are not independently and identically distributed (i.i.d.). By using tools from convex analysis and Rademacher complexity, we give some investigations on the performance of the learning algorithm and give an explicit convergence rate bound. The analysis results show that the learning rate can be improved by choosing the scale parameter σ properly. We give in Section 5 the proofs.
Here and later we write 2. Pairwise Mercer kernels and PRKHSs. Cucker and Smale gave a description of a usual RKHS with orthonormal systems (see Proposition 4 of [19], see also [51,60,61]), which itself is a way of constructing RKHSs. Along this line, we give here a method of constructing PRKHS with bivariate orthogonal polynomials.
Let Ω ⊂ R 2 be a simply connected bounded domain (having a nonempty interior) and w(x, y) be a no-negative and nonzero integrable function defined on Ω and For any f, g ∈ L 2 (w) we define an inner product Denoted by Π 2 n the linear space of real bivariate polynomials of total degree at most n, i.e., Π 2 n = p m,l (x, y) = x m y l + lower power terms : m + l ≤ n , and denoted by Π 2 the collection of all bivariate polynomials.
A polynomial p ∈ Π 2 n is called an orthogonal polynomial with respect to w(x,y) if p, q w = Ω p(x, y)q(x, y)w(x, y)dxdy = 0, for all q ∈ Π 2 n−1 . Let V 2 n denote the set of polynomials of exact degree n orthogonal with respect to the weight function w(x, y). Then by [34] we know V 2 n is a linear space of polynomials of dimension n + 1. Also, let {P n−k,k } n k=0 denote a basis of V 2 n . Then {P n−k,k (x, y) : (x, y) ∈ Ω, 0 ≤ k ≤ n, n = 0, 1, 2, · · · } forms an orthogonal polynomial system (see [34]).
Take P n = (P n−k,k ) n k=0 = (P n,0 , P n−1,1 , · · · , P 0,n ) T . Then P T n is a row vector. Using these notions, the orthogonality of P n can be expressed as (see [24]) where H n is a symmetric and positive-definite matrix of size (n + 1), O (m+1)×(n+1) is a (m + 1) × (n + 1) matrix in which all elements are zero. If H n is the identity matrix, then {P n−k,k : 0 ≤ k ≤ n, n = 0, 1, 2, . . .} is an orthonormal polynomial system (see [24]). We provide here a method for constructing Mercer kernel K (x,y) (x , y ) on Ω 2 with orthogonal polynomials on Ω.

SHUHUA WANG, ZHENLONG CHEN AND BAOHUAI SHENG
Let {P n−k,k (x, y) : (x, y) ∈ Ω, 0 ≤ k ≤ n; n = 0, 1, 2, · · · } be an orthogonal poly- is a pairwise Mercer kernel on Ω 2 under the assumption that λ n has a decay such that the convergence of the right side of (2.2) is absolute and uniform. Assume the series uniformly converges on Ω, we now show that K (x,y) (x , y ) is a Mercer kernel on Ω 2 . In fact, by Hölder inequality we have To show the convergence of (2.3), one needs to bound n k=0 h −2 n,k |P n−k,k (x , y )| 2 which itself is a problem touching upon the estimation of the Christoffel-Darboux formula (see e.g. [22,28,32,49]).

Remark 1. The following two cases show that the assumption (2.3) is reasonable.
• When Ω is a compact set and w(x, y) is a bounded function on Ω, e.g. 0 ≤ w(x, y) ≤ 1. Since n k=0 h −2 n,k |P n−k,k (x , y )| 2 is a continuous function on Ω for a given n, it can attain its maximum value M n on Ω. We can choose λ n > 0 such that ∞ n=0 λ n M n < +∞. • When Ω is a compact set and w(x, y) −1 is a bounded function on Ω, e.g. 0 ≤ w(x, y) −1 ≤ 1. For given n and k, we have by Also, we can choose λ n > 0 such that ∞ n=0 λ n M n < +∞. Take r n,k (f ) = f, h −1 n,k P n−k,k w , k = 0, 1, 2, . . . , n, n = 0, 1, 2, . . . to be the Fourier coefficients of f with respect to the orthonormal basis Define Then we have the following proposition.
Therefore, for any given (x, y) ∈ Ω we have K (x,y) (x , y ) ∈ H K . By (2.4) we know for So H K is a PRKHS with reproducing kernel K (x,y) (x , y ).
2.1. Some pairwise Mercer kernels. By Proposition 1 we know that to construct a PRKHS, one only needs to construct the pairwise Mercer kernels K (x,y) (x ,y ) satisfying (2.3). We give some examples. Take the classical Jacobi polynomials as and Γ(λ) is the Euler's Gamma function defined by (1) A bivariate pairwise Mercer kernel on the unit disk.
(2) A bivariate pairwise Mercer kernel over the triangle.
(4) A bivariate pairwise Mercer kernel over the square.
According to the results of [36] we know if {p n (x)} +∞ n=0 and {q n (x)} +∞ n=0 are orthogonal polynomial systems in one variable relative to dµ(x) and dν(x) respectively, then the product polynomials {p n−k (x)p k (y) : 0 ≤ k ≤ n, n = 0, 1, 2, · · · } forms an orthogonal polynomial system relative to the product measure dµ(x)dν(y). For there are the following Koornwinder orthogonal polynomials (see [24,44]) For (x, y), (x , y ) ∈ Ω, we may define a bivariate pairwise Mercer kernel as The λ n are chosen such that the series converges uniformly on Ω.

2.2.
Symmetric and antisymmetric pairwise Mercer kernels. In this subsection, we give two methods for constructing symmetric and antisymmetric pairwise Mercer kernels with orthogonal polynomial systems.

Online pairwise learning algorithm with quadratic loss function.
For any predictor f ∈ H K , and z = (x, y) ∈ Z, z = (x , y ) ∈ Z, the quadratic pairwise loss function is defined as where σ ∈ (0, 1] is a scale parameter, and sup y,y ∈Y |δ(y, y )| ≤ M with some constant M > 0. Some properties of the quadratic pairwise loss function are given below.
Proposition 2. For any f, g ∈ H K , there holds Proof. According to the median value theorem, for any f, g ∈ H K , there exists ξ between f (x, x ) and g(x, x ), such that And since for any ξ ∈ R, we have 1+ | ξ σ | 2 ≥| ξ σ |, so that, Take | V σ | 0 := sup y,y ∈Y V σ δ(y, y ) . It is easy to see that For a hypothesis f ∈ H K , the local regularized empirical error with the quadratic loss function is and the gradient of the local regularized empirical error E t λ,σ (f ) at f = f t is explicitly given by At the current iteration point f t , updating the predictor by using the negative gradient direction −∇ E t λ,σ (f ) | f =ft as the search direction, the online pairwise learning algorithm with quadratic loss functionV σ (r) is given as For any predictor f , z = (x, y) ∈ Z, z = (x , y ) ∈ Z, define the corresponding general error by and f λ,σ = arg min With these notions in hand, we can give the performance analysis for the algorithm (3.1).
4. The Performance. In this section, we present our main results about the performance of algorithm (3.1) developed in Section 3. Proofs are given in Section 5.2. 4.1. Convergence analysis. Now we provide a theorem which shows that under some conditions on the step sizes, the last iterate {f T } generated by algorithm (3.1) converges to the optimal function f λ,σ stated in (3.4). In this paper, we assume T ≥ 4. Theorem 4.2. Let λ ∈ (0, 1] and {f t } T +1 t=2 be the function sequence generated by algorithm (3.1). If the stepsize are chosen as η t = 1 λ t −θ and θ ∈ (0, 1), then Furthermore if the stepsize are chosen as η t = 1 λt , then we have the following result. λt , then The theorems provided above mainly describe the convergence rate of f T +1 − f λ,σ K which is usually referred as the sample error. However, in the studying on the learning performance of learning algorithms, we are often interested in the excess generalization error E σ (f T +1 ) − E σ (f σ )(see [20,26]). Define which is often used to denote the approximation error, whose convergence is determined by the capacity of H K . By combining the sample error and approximation error, we obtain the overall learning rate stated as follows.
where 0 < β ≤ 1. In fact, under the assumption (4.2), we now have the following Corollary, which follows directly from Theorem 4.3.

Further discussions.
We now give some discussions on the main results of the paper.
• In this paper, we analyze the convergence rate of the kernel-based regularization online pairwise learning algorithm with a quadratic loss function. The results show that the scale parameter σ can effectively control the convergence rate of learning algorithm. Depending on the circumstances, the learning rates can be greatly improved by choosing the parameter σ properly, which are better than existing error bounds.
In fact take σ = λ which shows that the parameter λ may be removed.
• By the Lipschitzian of V σ , we know there exits a constant C > 0 such that the K−functional K(f σ , λ) satisfies So the assumption (4.2) is reasonable if H K is density in L 2 (ρ X ) (see e.g. [57]).
• We establish the error bounds for the learning sequence using convex analysis and Rademacher complexity, while previous analysis rely on integral operator theory [72] or covering numbers approach [64]. And the results in [72] are in probability while our results are obtained in expectation. The quadratic loss function is not a strongly convex function, we have relaxed the strong convexity assumption in the literature [26]. And our method may be extended to some online pairwise learning algorithms with non-convex loss functions, e.g., the robust loss function in [62].
• For the special case V f (x, x ) − δ(y, y ) = y − y − f (x, x ) 2 , [72] studied the relations between the pairwise Mercer kernel and pointwise Mercer kernel. We consider the more general situation in this paper. [45] provided a way to construct pairwise kernels by considering a projection from the set of all kernels to the set of permutation invariant kernels. Unlike [45], we provide a method of constructing pairwise Mercer kernels with classical bivariate Jacobi orthogonal polynomials.

5.
Proofs. In this section, we first show some important lemmas, and then give the proofs of the main results. We use E z (·) to denote the expectation with respect to z. When underlying random variable in expectation is clear from the context, we will simply write E(·).

Some lemmas.
To prove the results in section 4, we need some lemmas as follows.
(ii) If F (f ) is a differentiable and convex function, then F (f ) attains minimal value at f * if and only if ∇ f F (f * ) = 0.
(iii) For any f, g ∈ H there holds 4038 SHUHUA WANG, ZHENLONG CHEN AND BAOHUAI SHENG Following Lemma 5.2 gives some properties of the generalization error E σ (f ) .
Proof. We prove (i) firstly. For any f, g ∈ H K simple computations show that By the Taylor formula we have dρ(x, y)dρ(x ,y ).
Using the reproducing property g(x, x ) = K x,x (·, ·), g K , we get (i) is proved. Now we prove (ii). For arbitrary u, v ∈ R , by the second-order Taylor expansion of V σ (r) at r = v, there exists ξ between u and v, such that Using Lemma 5.1 (i), E σ (f ) is a convex function on H K . We get our desired result.
The following Lemma shows that the function sequence {f t } is bounded.
Lemma 5.3. Let λ > 0, {f t } be the function sequence generated by the algorithm 3.1. If the stepsize {η t } satisfies η t λ ≤ 1, ∀t ≥ 2, then for every t ∈ N there holds Proof. We prove this conclusion by induction. The initial functions f 1 = f 2 = 0 certainly satisfy (5.5). Assuming f t satisfies the inequality (5.5), we now prove f t+1 also satisfies the inequality (5.5). Since we have By the assumption f t K ≤ κσ λ and η t λ ≤ 1 we get This completes the proof of the lemma.
Lemma 5.4. Let λ > 0, {f t } be the function sequence generated by the algorithm 3.1. If the stepsize {η t } satisfies η t λ ≤ 1, ∀t ≥ 2, then there holds inequality At first, we estimate A. From(5.6) and Lemma 5.3, one can see that By the definition of A t λ,σ and the reproducing property, we know that The convexity of V σ (r), r ∈ R implies that Combining (5.7), (5.8) and (5.9), we have . This completes the proof of Lemma 5.4.
To prove the main results, we need the concept of Rademacher complexity and its important property.
Proof of Theorem 4.1. Since η t → 0(t → +∞), there exists t 0 ∈ N, such that for any t ≥ t 0 , we have η t ≤ 1 2 √ 6λκσ . For the fixed t 0 , Proposition 3 implies that it is sufficient to estimate the three terms on the right side of (5.25) respectively.
(i) Bound the first term.