KERNEL-BASED ONLINE GRADIENT DESCENT USING DISTRIBUTED APPROACH

. In this paper we study the kernel-based online gradient descent with least squares loss without an explicit regularization term. Our approach is novel by controlling the expectation of the K-norm of f t using an iterative process. Then we use distributed learning to improve our result.


1.
Introduction. Different from the classical batch learning which learns from the entire data set, online learning seeks to learn from a data set with an increasing size. The gradient descent method is a powerful algorithm designed to find the optimal value of a function, and online gradient descent is an adaptation to the online scheme. This kind of stochastic approximation procedures can date back to [8,5]. The online gradient descent algorithm has been studied in [9,15] recently. In [14], the early stopping approach for batch learning is studied. In [9], the author studied an online gradient descent algorithm with a regularized term λf t , which can be formulated as follows: We call λ the regularization parameter and when λ > 0, the algorithm is called online regularized learning and it has been well studied in [7,9,16]. In [14], the regularized term is replaced by some early stopping rule, and in [15], the author studied (1) without an explicit regularized term (i.e. λ = 0). Our algorithm is the same as that in [15], but we prove the risk bound by using a novel method which involve an iterative process. In [15], the constant µ must be large enough in order to make the proof works. The size of µ should be set more freely apart from being a proving technique. In our approach, we can have a much looser definition of µ.
Recently, researchers are interested in online algorithms in various situations. By abandoning the identical restriction while retaining the independence, [10] studied online learning with Markov sampling. In [4], the online regression algorithm with Gaussian kernels with changing variance is introduced and analysed. For the situation of unbounded sampling, the online minimum error entropy was proposed [12]. In industrial applications, some new algorithms are proposed such as [11,3,13].
Data are being used with unprecedented size and complexity recently, which raises problems such as the storage bottleneck and the algorithm scalability. To overcome these challenges, some distributed approaches are being used base on the divide and conquer strategy [17,6]. In this paper, we use the distributed approach to eliminate the variance produced by the noise and obtain a better result. In [6], the authors use second order decomposition to estimate the learning error. In our paper, the approach is simpler and does not involve second order decomposition.
Let X be a compact subset of Euclidean space R d and Y be a subset of R. Define Z = X × Y . Let ρ be an unknown probability measure on Z and ρ Y |x be a conditional probability measure for a fixed x ∈ X. We define the generalization error of a function f : X → Y as The regression function is defined as which minimizes the generalization error over all measurable functions. Our purposed goal is to approximate the regression function from a set of sample Z = {z 1 , ..., z i , ...} drawn independently from the unknown probability measure ρ. We set our learning scheme in Reproducing Kernel Hilbert Spaces (RKHS). Here, the regularity of a function is characterized by the integral operator.
A function K : X × X → R is called a Mercer Kernel, when it is continuous, symmetric and semidefinite [2], where semidefinite is defined by requiring the ma- The closure of all the linear combination of {K t : t ∈ X} under the following inner product forms the Reproducing Kernel Hilbert Space induced by the kernel K. For f (·) = n i=1 c i K(x i , ·) and g(·) = m j=1 d j K(x j , ·), we define the inner product ·, · K as

One of the important properties of RKHS is the reproducing property which is characterized by
The online gradient descent algorithm is defined in the following way: This is an instance of (1) when setting λ as 0. The sequence {η t : t ∈ N} is called the step size or learning rate. Our machine receives a sequence of data {z t : t ∈ N} one by one where z t = (x t , y t ). The data z t are drawn independently. In the distributed learning we divide our source of data into J different subsets and we use the online gradient descent algorithm for each subset of data. The output function after t−th iteration on the j−th machine is denote as f j t . The algorithm can be formulated as follows: 2. Main result. By using an iterative process similar to [1], we can prove that for a sufficiently large C, the expectation of K-norm of f t can be controlled by C uniformly.
Let the Online Gradient Descent Algorithm be defined by (2), we have Proof.
Notice that and by using the process in the previous proof for 1−θ 2θ−1 , the result is derived.
With this estimation, we can prove the following theorem.
Let the Online Gradient Descent Algorithm defined in (2), we have the result at t iteration By using distributed online gradient descent, the following result is derived.
2 , let f j t be the resulting function after t-th iteration at the j-th machine by using the Distributed Online Gradient Descent (3) with η t = 1 µ t −θ and 1 ≤ J ≤ t (2β−1)(1−θ) , then the following holds where C is a constant independent of J or t.

XIAMING CHEN
3. Proof of main result. First, we prove a bound for the K-norm of step function f t .
Lemma 3.1. Let the learning sequence {f t : t ∈ N} be given by (2). Assume |y| ≤ M almost surely and, for any t ∈ N, that η t κ 2 ≤ 1. Then we have Proof. For t = 1, f 1 = 0 clearly satisfies the stated bound. For t > 1, we have from (2) that By taking square roots, we have This varifies the desired bound.

Lemma 3.2.
Let f H be defined as above, we have where f H is defined in (4).
Proof. We know that E(·) is differentiable and convex, hence we know that E(·) achieves the minimum value when its gradient is equal to zero and for g ∈ H K , Proposition 1. The operator L t on H K defined by L t = ·, K xt H K K xt has the operator norm bounded as L t ≤ κ 2 .
Proof. For f ∈ H K , we have Hence, we have L t 2 ≤ κ 4 and the proposition is proved.
Proof of Theorem 2.1. From (2) we have Hence, by summing up t iterations we have and similarly From (6), (7) and Lemma 3.2 we have for each t. Then we have For I 3 , assume i < j, from (8) we have . It holds that where holds for a sufficiently large C.
In general, we assume holds with s ≤ 1−θ 2θ−1 . By (9), we have Hence, we have By repeating this process until the case of j = 1−θ 2θ−1 , we have Proof of Theorem 2.2. From (5) we have by taking the square norm we have Hence, it holds that