Quantitative Robustness of Localized Support Vector Machines

The huge amount of available data nowadays is a challenge for kernel-based machine learning algorithms like SVMs with respect to runtime and storage capacities. Local approaches might help to relieve these issues and to improve statistical accuracy. It has already been shown that these local approaches are consistent and robust in a basic sense. This article refines the analysis of robustness properties towards the so-called influence function which expresses the differentiability of the learning method: We show that there is a differentiable dependency of our locally learned predictor on the underlying distribution. The assumptions of the proven theorems can be verified without knowing anything about this distribution. This makes the results interesting also from an applied point of view.


Introduction
This paper analyzes a special robustness property of localized kernel-based, non-parametric statistical machine learning methods, in particular of support vector machines (SVMs) (Boser, Guyon & Vapnik, 1992;Cortes & Vapnik, 1995), and methods close to them. There are many general introductions to these methods from the view of computer science and statistics. Summarizing textbooks are for example Cristianini & Shawe-Taylor (2000), Schölkopf & Smola (2001), Cucker & Zhou (2007), or Steinwart & Christmann (2008). These methods became pretty popular in many fields of science, see for example Ma & Guo (2014). The analysis provided by this paper refers to supervised learning, i. e. to classification or regression problems. Beyond this, support vector machines are a suitable method for unsupervised learning (e. g. novelty detection), too.
The paper can be seen as a sequel to Dumpert & Christmann (2018) where universal consistency and robustness with respect to the maxbias of localized support vector machines have already been shown. This paper is dedicated to refine the robustness analysis. It is organized as follows: Section 2.1 gives a short overview on support vector machines, Section 2.2 introduces shortly the idea of local approaches. The results concerning the influence function of localized support vector machines are given in Section 3. Section 4 finally summarizes the paper.

Support Vector Machines
where (1) is called the empirical risk of a function f with respect to a shifted loss function L * and an empirical measure D n = n −1 n i=1 δ (xi,yi) (where δ (x,y) denotes the Dirac measure at a point (x, y) ∈ X × Y) based on a sample D n = ((x 1 , y 1 ), . . . , (x n , y n )) of i. i. d. realizations (with respect to a joint distribution P on (X × Y, B X ×Y ), where B X ×Y denotes the Borel-σ-algebra on X × Y) of random variables X (input, with values in X ) and Y (output, with values in Y).
(2) is called the theoretical risk associated with (1). Minimizers of (1) are called empirical support vector machines and will be denoted by f L * ,Dn,λn ; minimizers of (2), i. e. theoretical SVMs, will be denoted by f L * ,P,λn . A supervised loss function (or shorter: a loss function) has to measure the difference between observed and predicted values in an appropriate way and is defined as a measurable function L : Y × R → [0, ∞[. (For unsupervised learning a slightly different definition is needed.) In order to create the link to Dumpert & Christmann (2018) we are also interested in the so-called shifted version L * of a loss function L, defined by L * : Y × R → R, L * (y, t) := L(y, t) − L(y, 0), see also Appendix B of Dumpert & Christmann (2018). Within the next lines, we have to recall some definitions and results. A loss function It is easy to show that if L is a loss function which is convex, then L * is convex -and if L is a loss function which is Lipschitz continuous, then L * is Lipschitz continuous with the same Lipschitz constant. Note that in all situations where the theoretical SVM with respect to an unshifted loss function L (f L,P,λn ) exists, it holds true that f L,P,λn = f L * ,P,λn . It is always true that f L,Dn,λn = f L * ,Dn,λn . Hence, the (computational) algorithms and the resulting predictors are the same (as far as they exist) with or without shifting the loss function.
The regularization parameter λ usually depends on the sample size n ∈ N (in this case, we write λ n ), is positive for all n ∈ N, and plays an important role within the next sections. The aim of support vector machines in supervised learning is to discover the influence of a (generally multivariate) input (or explanatory) variable X on a univariate output (or response) variable Y . Our goal is to explore the functional relationship that describes the conditional distribution of Y given X. X , the input space, is generally assumed to be a separable metric space. For some results of this paper X has additionally to be complete. For the rest of the paper, the output space Y is assumed to be a closed subset of the real line R. When we talk about a data set, a sample or observed data, we think (for n ∈ N) about an n-tuple, but note that, although it is a tuple, we treat it like a set and use notations like ∈, ∩, ...; nevertheless we allow that the sample contains a data point twice or several times. H denotes a reproducing kernel Hilbert space (RKHS). For the bijection between kernels and their reproducing kernel Hilbert spaces (RKHS) see Aronszajn (1950), Schölkopf & Smola (2001) and Berlinet & Thomas-Agnan (2001). A very important connection between the functions in an RKHS and its corresponding kernel is given by the following propositions (Steinwart & Christmann, 2008, Lemma 4.23, Lemma 4.28).
If and only if the reproducing kernel k of an RKHS H is bounded, every f ∈ H is bounded and for all f ∈ H, x ∈ X there is the Proposition 2.2 Let k be a kernel with RKHS H. Then k is bounded and k(·, x) : X → R is continuous for all x ∈ X if and only if every f ∈ H is bounded and continuous. Obviously: If k(·, ·) is continuous, then k(·, x) : X → R is continuous for all x ∈ X .
SVMs are known to be universal (risk-)consistent, i. e.

Localized approaches and regionalization
A short overview on the idea of localized statistical learning is already given in Dumpert & Christmann (2018). We now take it up again. There are two main aspects that show the need of localized approaches. First, the computational effort of kernel-based machine learning methods. The larger the sample the more costly the computation of a solution. Second, the statistical aspect. Different areas of X × Y might have different claims on the statistical method: There might be regions that require simple functions serving as predictors while other regions might need more volatile functions. The success of machine learning approaches often heavily depends on finding optimal hyperparameters. These parameters often determine the complexity of the predictor. By learning the set of hyperparameters for the whole input space, we often have to average out the specifics of the local areas. Local learning allows to use different hyperparameters and even different kernels in different regions. These regions have to fulfil some of the following assumptions.
(R1) A regionalization method divides the input space X into possibly overlapping regions, i. e.
B n is the number of regions, usually chosen by the regionalization method and therefore depending (at least) on a subsample drawn to do the regionalization. Note that B := B n is constant after the regionalization, so we have Note that this is not the same as robust learning from bites (Christmann, Steinwart & Hubert, 2007).
(R2) For every b ∈ {1, . . . , B} X b is a separable metric space (which is easy to fulfil as subsets of separable sets are separable and subsets of metric spaces are metric spaces (Dunford & Schwartz, 1958, I.6.4, I.6.12)), and, in addition, a complete measurable space, i. e., for all probability measures, Note that this notion of completeness refers to the measurability of null sets, see Ash & Doleans-Dade (2000, Definition 1.3.7).
(For an arbitrary set M , |M | denotes the number of its elements.) . . , B}, in the sense that every Cauchy sequence in X b has a limit in X b . Note that this is easy to ensure by using the completion of the results of the regionalization method. (This is not a problem for the regionalization because the regions need not to be disjoint.) In a situation where the whole input space X is divided by a regionalization method into some regions X 1 , . . . , X B -which need not to be disjoint -we learn the SVMs separately, one SVM for each region. After that, we combine these local SVMs to a composed estimator or classifier, respectively. The influence of the local predictors may be controlled pointwise by measurable weight functions w b : X → [0, 1], b ∈ {1, . . . , B}, which have to fulfil the following two conditions for all x ∈ X : (W1) B b=1 w b (x) = 1 for all x ∈ X , and (W2) w b (x) = 0 for all x / ∈ X b and for all b ∈ {1, . . . , B}.
We follow the notation in Dumpert & Christmann (2018) and define the composed predictors as follows: where • P is the unknown distribution of (X, Y ) on X ×Y and D n := n −1 n i=1 δ (xi,yi) is the empirical measure based on a sample or data set D n := ((x 1 , y 1 ), . . . , (x n , y n )) of n i.i.d. realizations of (X, Y ).
• P b is the theoretical distribution on X b × Y, D n,b its empirical analogon. They are in fact probability distributions in all interesting situations, i. e. if P (X b ×Y) > 0 or D n (X b ×Y) > 0, respectively, because they are built from P and D n as follows: • In an analogous way, the regional marginal distribution of X is defined by P if P X (X b ) > 0 and 0 otherwise.
or, if we want to emphasize the number of data points, also In the situation of a predictor composed of locally learned SVMs, this predictor is universal (risk-) consistent, too. We recall the relevant theorem from Dumpert & Christmann (2018). in probability with respect to P .

Robustness in terms of the influence function
First, please note that there is already a robustness result in terms of the so-called maxbias shown in Dumpert & Christmann (2018). In this paper we use another notion of robustness, the so-called influence function according to Hampel (1968) considering a statistical operator S which assigns to every distribution P on the Borel-σ-algebra B M of a suitable set M an element of a Banach space, i. e. in the situation at hand the predictor f L * ,P,λ (which is in the approach without regionalization even an element of a (reproducing kernel) Hilbert space).

Definition 3.1
The influence function of S at a point z for a distribution P is (if it exists) where δ z is the Dirac distribution at the point z.
The influence function can be interpreted in the way that it measures the impact of an infinitesimal small amount of contamination of the original distribution P in direction of a Dirac distribution in the point z on the quantity of interest S(P ). If the influence function exists and if it is continuous and linear, then it is a Gâteaux derivative of the operator S : M 1 (X ×Y, B X ×Y ) → H, P → f L * ,P,λ in the direction of the mixture distribution (1−ε)P +εδ z . From this point of view, we are interested in conditions where our statistical method has a bounded influence function: the lower the bound, the more robust the method. Note that in this context IF itself is a function mapping a Dirac measure δ on (X × Y, B X ×Y ) to a predictor in a RKHS, i. e. IF(δ; S, P )(·) ∈ H. Therefore we can evaluate IF(δ; S, P )(·) at a point x ∈ X to receive a real value (IF(δ; S, P )(x) ∈ R for all x ∈ X ) due to Proposition 2.2 if we use a continuous and bounded kernel. (2009) the influence function (in the unregionalized situation) exists and is bounded if X is a complete, separable metric space, H is an RKHS of a bounded and continuous kernel k, L is a convex and Lipschitz continuous loss function with continuous partial (Fréchet-)derivatives (with respect to the last argument) L ′ (y, ·) and L ′′ (y, ·) with sup y∈Y L ′ (y, ·) ∞ ∈ ]0, ∞[ and sup y∈Y L ′′ (y, ·) ∞ < ∞. The upper bound of the influence function in H-norm is given by 2 λ −1 k ∞ |L| 1 . In sup-norm the upper bound is then 2 λ −1 k 2 ∞ |L| 1 according to Proposition 2.1.

Proposition 3.2 As shown in Christmann, Van Messem & Steinwart
As it is already the case in the proof of universal consistency in Dumpert & Christmann (2018), all assumptions can be verified without knowing anything about the underlying distribution P . As an example one might mention a standard scenario: x,x ∈ X , for a γ > 0 and the logistic loss function for regression L(y, t) := − ln 4 exp(y − t)(1 + exp(y − t)) −2 or for classification L(y, t) := ln(1 + exp(−yt)), respectively. Note that these loss functions fulfil the required properties but lead -unfortunatelyonly to convex optimization problems (instead of quadratic problems with box constraints which result by using non-smooth loss functions like the hinge loss for classification or the ε-insensitive loss for regression). Nevertheless there are extensions of the proofs on robustness properties also for these non-smooth loss functions, see Christmann & van Messem (2008), Steinwart (2009), andChristmann (2010), but we would not prove these extensions for the localized situation within this paper. In the global, i. e. not regionalized, situation, we can rewrite the influence function as follows: This is used to define an influence function of the composed predictor defined in (3). Recall that this composed predictor is -in general -not an element of a Hilbert space -however, it is an element of L ∞ (P X ) on X and by this an element of a Banach space if we use bounded kernels. Thus, Hampel's definition is suitable in the regionalized situation, too. Define IF comp (δ z ; S, P ), i. e. the influence function of the composed predictor, straightforwardly as follows.

Definition 3.3
The influence function of the composed predictor as defined in (3) is (if it exists) Note that in this regionalized context IF comp itself is a function mapping a Dirac measure δ on (X × Y, B X ×Y ) to a predictor in the Banach space, i. e. IF comp (δ; S, P )(·) ∈ L ∞ (P X ) (if we use a bounded kernel, see Proposition 2.1). It is possible to show that also a composed predictor as defined in (3) has a bounded influence function. To do this, we use the following notation: By this,P b,ε,z stands for the mixture distribution on (X b × Y, B X b ×Y ) if the SVM on X b is affected by δ z . In all other situations,P b,ε,z = P b . This notation is necessary to guarantee that a local SVM is always learned with respect to a probability measure. Note that the local influence function IF b is 0 in all situations whereP b,ε,z = P b , b ∈ {1, . . . , B}.
Note that fulfilling assumptions (R1) to (R4) is sufficient to produce such regions X b out of an input space X . Also note that continuous kernels are of course measurable, and that their corresponding RKHSs are separable, see Steinwart & Christmann (2008, Lemma 4.33). Note that sup y∈Y L ′ (y, ·) X b -∞ ∈ ]0, ∞[ already implies the Lipschitz continuity of L with |L| 1 = 0. This is useful for a fair comparison of the assumptions of the different theorems on consistency (Theorem 2.3) and robustness (Theorems 3.4 and 3.7).

Proof. [Theorem 3.4]
To show the result, we decompose the predictor.
According to Proposition 3.2 the influence function of every local SVM exists and is bounded. Thus, the above sum exists and is bounded, too.
The upper bounds of the local influence functions, see Christmann, Van Messem & Steinwart (2009), can be used to give an upper bound of the influence function of the global predictor. Every . . , B}, H b is the RKHS of k b , and where · (X b ×Y)-TV is the total variation norm on the space of distributions on (X b × Y, B X b ×Y ), for details see, e. g., Denkowski, Migórski & Papageorgiou (2003, p. 158). Note that according to
Proof. [Theorem 3.5] By Theorem 3.4 we can straightforwardly prove an upper bound for the influence function of the global predictor using the triangle inequality and using that the weights The last inequality is true due to a general (and very rough) upper bound on the total variation norm. The inequality before follows from Christmann, Van Messem & Steinwart (2009, Theorem 12) using the representer theorem for support vector machines with (convex and Lipschitz continuous) shifted loss functions (Christmann, Van Messem & Steinwart, 2009, Theorem 7).
Note that there is a trade-off between two important properties of statistical methods in general and SVMs in particular: One of the assumptions for consistency of the composed global predictor in Theorem 2.3 is that λ (n b ,b) → 0 for all b ∈ {1, . . . , B}. Having a look at Inequality (5) we see that the smaller λ b the higher is the upper bound of the influence function. This means that there is a trade-off between consistency and robustness of predictors based on locally learned SVMs.
(This trade-off exists for SVMs in general -not only in the regionalization approach.) The same problem has already arisen for the upper bound of the maxbias in Dumpert & Christmann (2018) and is well-known for ill-posed problems in general and also for other notions of robustness, see e. g. Hable & Christmann (2013).
Following Christmann & Steinwart (2004) it is possible to show properties of the influence function not only for a Dirac measure δ z but also for an arbitrary distribution Q on (X × Y, B X ×Y ). In the situation of a locally learned predictor, we can prove this, too. In analogy to P and P b we define if the support of Q has a part in X b × Y and the null measure otherwise. Using this, we can definẽ Note again that on all regions X b × Y whereP b,ε,Q = P b the local influence function IF b is zero.

Corollary 3.6
Under the assumptions of Theorem 3.4 IF comp (Q; S, P ) exists and is bounded with upper bound 2 |L| 1 B b=1 Proof. [Corollary 3.6] The proof can be done analogously to the proofs of Theorems 3.4 and 3.5 in consideration that the local influence functions exist and are bounded.
To compare this notion of robustness to another one, the so-called maxbias, we recall the corresponding theorem from Dumpert & Christmann (2018).
Theorem 3.7 Let X be a separable metric space. Let L be a convex, Lipschitz-continuous (with Lipschitzconstant |L| 1 = 0) loss function and L * its shifted version. For all b ∈ {1, . . . , B} let k b be a measurable and bounded kernel and let the corresponding RKHSs H b be separable. Let the regionalization method fulfil (R1), (R2), and (R3).
Then, for all distributions P on (X × Y, B X ×Y ) and all λ : This bound is also a uniform bound in the sense that it is valid for all distributions P and all weighting schemes fulfilling (W1) and (W2), i. e. B b=1 w b (x) = 1 for all x ∈ X and w b (x) = 0 for all x / ∈ X b and for all b ∈ {1, . . . , B}. In contrast to robustness in terms of the influence function, we do not have to fulfil the assumption that the regions X b , b ∈ {1, . . . , B}, are complete or that the loss function is differentiable in order to prove the upper bound of the maxbias. On the other hand, the proof of the existence of the local influence functions (Christmann, Van Messem & Steinwart, 2009, Theorem 10) uses an implicit function theorem on Banach spaces and needs the completeness assumption for X b for all b ∈ {1, . . . , B} and the continuity assumption for the kernels k b to show that the therein appearing inverse exists. x − x ′ 2 2 , γ b > 0, for all b ∈ {1, . . . , B}, and let L be the logistic loss for classification or regression. Then Theorem 3.5 and Corollary 3.6 provide the uniform upper bound 2 B b=1 λ −1 b for the influence function of the composed predictor.
We can compare not only the two mentioned notions of quantitative robustness (maxbias and influence function) but also robustness in the regionalized vs. in the unregionalized (i. e. in the global) case. In the latter one, there is only one region (B = B n = 1). Using this in (5) or (6), respectively, and compare this to Proposition 3.2 or Christmann, Van Messem & Steinwart (2009, Theorem 12), respectively, we see that we do not lose robustness by using localized SVMs instead of one global one. (Note that w b X b -∞ uses to be 1, b ∈ {1, . . . , B}. Otherwise there would be a region with no points (x, y) ∈ X × Y on its own, i. e. a region that shares all of its points with at least one other region. This seems to be unrealistic as an outcome of a regionalization method.)

Summary
By proving and discussing quantitative robustness properties of locally learned predictors we have refined our analysis on local learning. We showed that quantitative robustness properties of kernelbased methods like support vector machines are conserved in the local approach. We see that there is no disadvantage of learning separate predictors, one for each region, and combining them from this point of view. All of the results have been shown for all distributions and only under assumptions which are verifiable by the user.