Quantitative convergence analysis of kernel based large-margin unified machines

High-dimensional binary classification has been intensively studied in the community of machine learning in the last few decades. Support vector machine (SVM), one of the most popular classifier, depends on only a portion of training samples called support vectors which leads to suboptimal performance in the setting of high dimension and low sample size (HDLSS). Large-margin unified machines (LUMs) are a family of margin-based classifiers proposed to solve the so-called "data piling" problem which is inherent in SVM under HDLSS settings. In this paper we study the binary classification algorithms associated with LUM loss functions in the framework of reproducing kernel Hilbert spaces. Quantitative convergence analysis has been carried out for these algorithms by means of a novel application of projection operators to overcome the technical difficulty. The rates are explicitly derived under priori conditions on approximation and capacity of the reproducing kernel Hilbert space.

1. Introduction. In this paper we consider large-margin unified machines (LUMs) for binary classification problems and investigate the consistency of kernel based LUMs within the framework of learning theory.
Let the input space X be a compact domain of R d and the output space Y = {−1, 1} representing the two classes. We assume that a sample z = {(x i ,y i )} m i=1 ∈ Z m is generated by a probability measure P on Z := X×Y in an i.i.d.fashion. The learning target in binary classification is to find a classifier C : X → Y based on the sample data such that for a new observation (x, y) we have C(x) = y with high probability. We define R(C) = Prob{C(x) = y} = X P (y = C(x)|x)dP X as the misclassification error which is used to measure the prediction power of a classifier C. Here P X is the marginal distribution of P on X and P (y|x) is the conditional distribution at x ∈ X. The classifier minimizing the misclassification error is called the Bayes rule f c defined as f c (x) = 1 if P (y = 1|x) ≥ P (y = −1|x) and f c (x) = −1 otherwise.
The classifiers considered here are induced by real-valued functions f : X → R as C f defined by sgn(f )(x) = 1 if f (x) ≥ 0 and sgn(f )(x) = −1 otherwise. The realvalued functions are generated from different classification algorithms. By definition of the classification rule, it is clear that correct classification occurs if and only if yf (x) > 0. The quantity yf (x) is referred as the functional margin and it plays an essential role in large-margin classification algorithms.
Among various margin-based methods, support vector machine (SVM) [4,6] is the most well-known one. SVM falls into the group of the so-called hard classification since it tends to directly estimate the classification boundary. In addition, Boosting [10,11] is also a typical hard classification method. Differing from the hard classification, soft classification rule aims at estimating the class conditional probabilities explicitly and then predicting the class based on the largest estimated probability. Fisher linear discriminant analysis and logistic regression are two typical soft classification methods [12]. These two kinds of methods are based on different philosophiles and each has its own merits. In a recent work [14], Liu and his coauthors proposed a unified framework of large-margin classifiers, namely largemargin unified machines (LUMs), which offers a unique transition from hard to soft classifiers. In addition, it was pointed out in [15] that the SVM may suffer from "data piling" phenomena for high dimension low sample size (HDLSS) data, that is, a large portion of data points are support vectors and they will pile up on top of each other while projected onto the normal vector of the separating hyperplane. Data piling is usually not desirable for a classifier since it may reduce generalizability and lead to suboptimal performance in the HDLSS setting. See [15] for real data example and more details. To solve the data piling problem, a large-margin classifier called distance-weighted discrimination (DWD) was proposed therein. Note that DWD is a special case of LUMs. The corresponding LUM loss functions are defined as follows: Definition 1.1. For given p ≥ 0 and q > 0, define LUM loss functions as if t < p 1 + p , (1.1) Here p and q play different roles. The parameter p controls the connecting point between the two pieces of the loss function as well as the shape of the right piece. The parameter q determines the decaying speed for the right piece. The LUM loss defined in (1.1) represents a large family of loss functions. Many often used loss functions fall into LUM form. For example, if p → ∞ and q > 0, the LUM loss reduces to V (h) (t) = (1 − t) + , the hinge loss widely used in support vector machine (SVM). If p = 1 and q = 1, the LUM loss reduces to the DWD loss function proposed in [24] as follows: If p = q > 0, then is the so called generalized DWD loss defined in [24]. If q → ∞ and p ≥ 0, the LUM loss reduces to if t < p 1 + p , which is a hybrid of SVM and AdaBoost [11,14]. In particular, when q → ∞ and p = 0, is the combination of the hinge loss and exponential loss. We say that K : X × X → R is a Mercer kernel if it is continuous, symmetric and positive semidefinite in the sense that the matrix (K(x i , x j )) l i,j=1 is positive semidefinite for any {x 1 , · · · , x l } ⊂ X. The reproducing kernel Hilbert space (RKHS) (see [1]) H K associated with the kernel K is defined to be the completion of the linear span of the set of functions {K x = K(x, ·) : x ∈ X} with the inner product ·, · K given by K x , K y K = K(x, y). Denote κ = sup x∈X K(x, x). RKHS has the reproducing property With the LUM loss V and an RKHS H K , the kernel based LUMs can be formulated as the following regularization scheme [22]: where λ > 0 is a regularization parameter balancing data fidelity and model complexity.
In [14], the Fisher consistency associated with LUM loss functions is provided. Later on, [24] formulates a kernel DWD approach in a reproducing kernel Hilbert space and further establishes the Bayes risk consistency of the kernel based DWD. To the best of our knowledge, there is no any quantitative convergence analysis for the kernel based LUMs, even not for the kernel based DWD, except for the classical support vector machine associated with the hinge loss, which has been already well studied in a large literature. See [37,5,30,34,36,18,20] and references therein. The purpose of this paper is to provide quantitative convergence analysis for the kernel based LUMs, i.e., to estimate the excess misclassification error R(sgn(f z )) − R(f c ) as m → ∞.

Key properties and main results.
2.1. Properties of LUM loss. In this section we give some discussions on LUM loss. From the definition of the LUM loss, it is easy to verify the following property.
Lemma 2.1. Let V be the LUM loss functions defined in (1.1), then V (·) is a convex function. It is differentiable everywhere for 0 ≤ p < ∞. Moreover, V (·) has the smallest zero at 1 when p → ∞. For 0 ≤ p < ∞, the loss functions V (·) have no zero.
The following decay property of the LUM loss is required in our error analysis, which is proved in Appendix. Lemma 2.2. Let V be the LUM loss functions with 0 ≤ p < ∞ and 0 < q < ∞.
Denote the generalization error E(f ) associated with the LUM loss functions by Let η(x) = P (y = 1|x). It was shown in [14] that the minimizer f P of E(f ) over all measurable functions for 0 < q < ∞ and 0 ≤ p < ∞ is defined by For p → ∞, the LUM loss reduces to the hinge loss for the SVM with the minimizer as follows: For q → ∞, the LUM loss reduces to the hybrid loss of the hinge loss and exponential loss with the minimizer below: The excess misclassification error R(sgn(f )) − R(f c ) for the classifier sgn(f ) can be bounded by means of the excess generalization error E(f ) − E(f P ) according to some comparison theorems. A comparison theorem was first proved in [37] for the hinge loss V (h) (t) = (1 − t) + . For a general convex loss with zero point, the comparison theorems have been studied in [3,5,20,29,31].
Recently, [9] gives a complete study for the comparison theorems of the LUM loss functions stated in the following lemma.
For any probability measure P, any measurable function f : X → R, and some constant C p > 0, it holds (2.6) (ii) Let V be the LUM loss with p = 0. For any probability measure P, any measurable function f : X → R, and some constant C q > 0, it holds Obviously, (2.7) for the case of p = 0 is worse than (2.6) for the case of p > 0. It may be improved by introducing the following Tsybakov noise condition [21] on the probability measure P.
Definition 2.5. Let 0 ≤ τ ≤ ∞. We say that probability measure P satisfies a Tsybakov noise condition with exponent τ if there exists a constant C τ such that Lemma 2.6. Let V be the LUM loss with p = 0. Under the assumption (2.8) with 0 ≤ τ ≤ ∞, the following comparison theorem holds true with some constant C q,τ > 0 : 2.2. Projection operator and error decomposition. We notice the fact that when 0 ≤ p < ∞, the LUM loss V has no zero on R, which leads to an unbounded target function f P . It causes some difficulties in our analysis. In particular, the projection technique and concentration inequality used for SVM cannot be directly applied here. To overcome this difficulty, we introduce a different projection operator π M defined below: Definition 2.7. For any M > 0, the projection operator π M on the space of function on X is defined by In [32], the similar projection operator was proposed to analyze the binary classification with logistic loss. Since the LUM loss without zero leads to an unbounded target function f P , we assume that the projection operator has the form with varying levels, i.e., M = M (m). The projection operator π M involved in this paper differs from the original one introduced for classifying loss with zero (see e.g., [5,34,19,33] for details). Obviously ). Therefore, we turn to estimate the so-called excess misclassification error R sgn(π M (f z ))) − R(f c ) with confidence as the sample size m → ∞. This can be done by bounding the excess generalization error E(π M (f z )) − E(f P ) because of the comparison theorems stated in (2.6) and (2.9). Define the empirical error associated with the loss V as Denote the sample free version of (1. 3) by 2). Then we conduct an error decomposition on the excess generalization error E(π M (f z )) − E(f P ) in the following lemma, which is a direct corollary of Propositions 5.4 and 5.6 in [26]. We include the proof in the Appendix for completeness.
In Lemma 2.8, we call S z (f ) sample error and D(λ) regularization error which is independent of sample. Notice that the main difference of Lemma 2.8 from the error decomposition in the literature [5,34,19,33] is the appearance of V (M ). The reason is that V (t) without zero is a strictly decreasing and positive function.

Main results.
To state our main results, we need the capacity of the hypothesis space measured by covering numbers and the bound of the regularization error.
Definition 2.9 (Uniform covering number). For a subset S of C(X) and u > 0, the covering number N (S, u) is the minimal integer l ∈ N such that there exist l disks with radius u covering S.
The covering numbers of unit balls of classical function spaces have been well studied in the literature (see e.g. [8,2,25,39,40]). Here we need the covering numbers of the balls of the RKHS H K . Denote B R = {f : f H K ≤ R}. Estimating uniform convergence in terms of covering numbers has been well developed in learning theory, e.g. [7,5,27].
To derive the explicit convergence rates for the kernel based LUMs, we impose some assumptions on the covering number and the regularization error. Assumption 2. Assume that for some constant C r > 0, the regularization error satisfies D(λ) ≤ C r λ r , 0 < r ≤ 1. (2.12) Let us illustrate our main result by the following special case which will be proved in Section 3.

13)
whereC 1 is a constant independent of m or δ.

14)
whereC 2 is a constant independent of m or δ.
Note that (2.12) holds with r = 1 if the target function f P ∈ H K . See [17,16,35] and references therein for more discussions on the regularization error D(λ).
The next theorem gives a convergence result under general assumptions.

16)
whereC 1 is a constant independent of m or δ and the power index ϑ is given in terms of r, s, q, α and η by 2. If p = 0, for any probability measure P satisfying (2.8) with 0 < τ ≤ ∞, for any 0 < δ < 1, with confidence 1 − δ, there holds

18)
whereC 2 is a constant independent of m or δ and the power index ϑ is given in (2.17).
Note that the error bound obtained in Theorem 5.7 of [26] also applies to loss functions without zero under assumptions on the variance-expectation bound and f P ∞ < ∞. In this paper we consider unbounded target function f P and it will be of great interest to verify whether the variance-expectation bound holds for LUM loss functions in order to improve the learning rates.
The index ϑ can be viewed as a functions of parameters r, s, q, α, η. The restrictions 0 < α < 4q 3s(q+1) on α and (2.15) on η ensure that ϑ is positive, which verifies the valid learning rate in Theorem 2.11. Assumption 1 is a measurement of regularity of the kernel K when X is a subset of R n . In particular, s can be arbitrarily small when K is smooth enough. In this case, the power index ϑ in (2.17) can be arbitrarily close to min αr, α(r−1)+1 2 , q 2(q+1) . The convergence results can be extended to the case of q = ∞ by noting the exponential decay on the right tail of V (he) (t), which is beyond the scope of this paper.

Error analysis.
3.1. Sample error. We are now in the position to estimate the sample error S z (f λ ) and S z (π M (f z )) defined in Lemma 2.8 by the following Hoeffding inequality and covering numbers.
Lemma 3.1. Let ξ be a random variable on a probability space Z with mean E(ξ) = µ, and satisfy |ξ − E(ξ)| ≤ B for almost all z ∈ Z. Then for all > 0, For R ≥ 1, let W(R) be the subset of Z m defined by It follows from (2.10) and (2.12) that Proposition 1. Let V be the LUM loss with 0 ≤ p < ∞ and 0 < q < ∞. Suppose Assumptions 1 and 2 hold for s > 0 and 0 < r ≤ 1. Let R ≥ 1, M ≥ 1, and 0 < δ < 1. Then there exists a subset V R of Z m with measure at most δ such that for any z ∈ W(R)\V R , Proof.
Step 1. Let us first estimate the quantity S z (f λ ) which can be expressed as ). Hence we have |ξ −E(ξ)| ≤ V (− f λ L ∞ (X) )+V (−M ). Applying Lemma 3.1 to the above random variable, we know that Solving the equation for given by we have that there exists a subset Z 1,δ of Z m with confidence at most δ/2 such that Hence, for any 0 < δ < 1, with confidence 1− δ 2 , the term S z (f λ ) can be bounded as Step 2. Next, we estimate the term −S z (π M (f z )) which involves the function f z . Here f z runs over a set of functions since z is a random sample itself. To estimate it, we use a standard argument (see e.g. [7]) with Hoeffding inequality and covering numbers.
Let J = N B R , 4 . Then there exists a set of {f j } J j=1 ⊂ B R , such that B R is covered by balls B (j) centered at f j with radius 4 .
Let j ∈ {1, · · · , J}. The random variable . Therefore, the one-side Hoeffding inequality implies that Therefore, we conclude that Prob z∈Z m sup Denote * (m, R, M, δ/2) as the positive solution to the equation which can be expressed as By Lemme 7.2 in [7], the positive solution * (m, R, M, δ/2) of this equation can be bounded as * (m, Thus, there exists a second subset Z 2,δ of Z m with measure at most δ 2 such that Step 3. Combining above two steps, we estimate the total error E(π M (f z )) − E(f P ) + λ f z 2 H K . Notice the fact that the measure of the set V R := Z 1,δ ∪ Z 2,δ is at most δ, hence, for z ∈ W(R)\V R , we get that Plugging (2.12), (3.2) and Lemma 2.2 into (3.4), it follows that Here we have used the reproducing property (1.2) in H K which yields This completes the proof of the theorem.

Strong bound by iteration.
In Proposition 1, we need some R ≥ 1 for z ∈ W(R). We can choose R = λ −1/2 according to which comes immediately by taking f = 0 in (1.3). We observe that the bound in (3.2) is much better than this choice. This motivates us to get a similar tight bounds for f z . We will apply Proposition 1 iteratively to achieve this target which in turn improves learning rates. This iteration technique has been used in [28,20].
Denote ∆ = s 4+2s < 1 2 . The definition of sequence {R (j) } J j=0 tells us that Let us bound the two terms on the right-hand side of (3.8).
Proof of Theorem 2.11. Take R to be the right side of (3.6). By Lemma 3.2, there exists a subset V R of Z m with measure at most δ such that Z m \ V R ⊆ W(R).
Applying Proposition 1 to this R, we know that there exists another subset V R of Z m with measure at most δ such that for any z ∈ W(R) \ V R , Since the set V R ∪ V R has measure at most 2δ, after scaling 2δ to δ and setting the constant C 3 by we see that with confidence 1 − δ and the power index ϑ is given by With the choice of β = 1 2(q+1) < 1 2 , θ η < q s(q+1) . By the restriction 0 < α < 4q 3s(q+1) on α, we find that α(1−r) < q s(q+1) . Moreover, restriction (2.15) on η implies that α(2+s)+2β−1 4+s + η < q s(q+1) . Therefore, condition (3.9) is satisfied. The above restriction and the expression for ϑ tells us that the power index for the error bound can be exactly expressed by formular (2.17). Combining with (2.6) and (2.7), the proof of Theorem 2.11 is complete withC 1 = C p C 3 andC 2 = C q,τ C 3 respectively. Now we are in the position to prove Theorem 2.10.
4. Discussion. This paper established quantitative convergence analysis for a class of kernel based large-margin unified machines, which were proposed in order to solve the so-called "data piling" problem in the setting of high dimension and low sample size. We derived explicit learning rates for this kind of learning schemes under mild conditions on the regularization error and the capacity of RKHS measured by uniform covering number. Note that it is possible to improve the learning rates by considering some special Mercer kernels, such as Gaussian kernel [34,31,32]. It will also be interesting to study kernel based LUMs in the settings of multicategory learning [13] and pairwise learning [38,23] within the framework of statistical learning theory. These will be our future work.
Appendix. If p ≤ q, we observe that t − p−q 1+p ≥ t. Notice that t ≥ p 1+p . It follows that If p > q, since t ≥ p 1+p , we observe that t − p−q 1+p = t − p 1+p p−q p ≥ t − p−q p t = q p t. It yields that V (t) ≤ 1 1 + p p 1 + p Our conclusion follows by subtracting and adding E(π M (f P )) and E z (π M (f P )) in the first and third term of (4.1).