SUPPORT VECTOR MACHINE CLASSIFIERS BY NON-EUCLIDEAN MARGINS

. In this article, the classical support vector machine (SVM) classiﬁers are generalized by the non-Euclidean margins. We ﬁrst extend the linear models of the SVM classiﬁers by the non-Euclidean margins including the theorems and algorithms of the SVM classiﬁers by the hard margins and the soft margins. Specially, the SVM classiﬁers by the ∞ -norm margins can be solved by the 1-norm optimization with sparsity. Next, we show that the non-linear models of the SVM classiﬁers by the q -norm margins can be equivalently transferred to the SVM in the p -norm reproducing kernel Banach spaces given by the hinge loss, where 1 /p +1 /q = 1. Finally, we illustrate the numerical examples of artiﬁcial data and real data to compare the diﬀerent algorithms of the SVM classiﬁers by the ∞ -norm margin.

. Examples of Euclidean margins and non-Euclidean margins are illustrated. The black line is the decision boundary. The distance of the gap between the two red dashed lines is the margin. We can see the difference between these two classifiers.
2. Linear support vector machine classifiers by non-Euclidean margins.
2.1. Maximal margin classification. Now we review the basic ideas of the binary classifications by the maximal margins in [17]. In the binary classification, we have a training data set D := {(x i , y i )} N i=1 composed of input data points x 1 , x 2 , . . . , x N ∈ R d and output data labels y 1 , y 2 , . . . , y N ∈ {±1}. We will construct a hyperplane H := {x ∈ R d : x ω + b = 0} to classify the data such that sign (x i ω + b) > 0 if y i = 1, and sign (x i ω + b) < 0 if y i = −1, for all i = 1, 2, . . . , N .
At first, we assume that D is linearly separable, that is, the separating hyperplane H exists. By the properties of H, each class of data points lies on each side of H, respectively. To avoid over-fitting, the geometrical feature of H is that the Euclidean distances from the data points to H are needed to be large enough. For fairness of both classes, the decision hyperplane H is chosen in the middle of each class so that the minimal distances from H to both classes are equal. By the classical maximal margin classification, the Euclidean margin is the Euclidean distance of the gap of two classes based on the decision hyperplane, for example, the 2-norm margin in Figure 1a. By the different geometrical structures of the practical applications, we will extend the Euclidean margin to the non-Euclidean margin, that is, the generalization from Euclidean distance to non-Euclidean distance, for example, the ∞-norm margin in Figure 1b. Similarly the SVM classifiers by non-Euclidean margins are constructed to maximize the non-Euclidean margins.
The linear model is to construct the decision hyperplane to separate the data by maximizing non-Euclidean margins.
Let · be a norm defined on R d and · * be its dual norm defined on R d such that their dual bilinear product is given by the vector product. By Theorem A.1, we have where dist(x, H) refers to the distance from a point x to a hyperplane H. The linear separability of data set guarantees that y i (x i ω + b) ≥ 0 and so y i = sign(x i ω + b) for all i = 1, 2, . . . , N . If ω * = 1, then we also have dist(x i , H) = y i (x i ω + b), for all i = 1, 2, . . . , N.
The idea of maximal margin classification is to maximize the minimum of the distances from x i to H for all i = 1, 2, . . . , N . Since (1) To solve Optimization (1), we transfer the norm constraint ω * = 1 with ω * = 1/M . Thus, Optimization (1) can be transferred to the constrained optimization min ω∈R d ,b∈R ω * subject to y i (x i ω + b) ≥ 1, i = 1, 2, . . . , N. (2) The objective function of Optimization (2) is a convex function and the constraints are linear inequalities. This assures that Optimization (2) is a convex problem. Next, we show the existence and conditional uniqueness of the solutions to Optimization (2). Proposition 1. If D is linearly separable, then the solution set to Optimization (2) is non-empty, compact, and convex. In particular, if the norm · * is further strictly convex, Optimization (2) has one and only one solution.
Proof. Since D is linearly separable, there exists aû := (ω,b) ∈ R d+1 such that y i (x iω +b) ≥ 1 for all i. Thus the feasible solution set to Optimization (2) is non-empty, we denote it by Ξ. Since the constraints are linear inequalities, Ξ is convex. We can equivalently transfer (2) to min u=(ω,b)∈R d+1 where δ(u|Ξ) = 0 if u ∈ Ξ and +∞ otherwise. It is worthy pointing out that after introducing δ(·|·), the range is now the extended real number lineR := R∪{∞, −∞} rather than R.
To complete the proofs, we only need to discuss the solution set to (3). By Theorem 27.2 in [12], we only need to show that the objective function F (u) := ω * +δ(u|Ξ) is a convex closed proper function which has no direction of recession, then the solution set to (2) is non-empty, closed, bounded (so compact) and convex.
We choose u 1 := (ω 1 , b 1 ), u 2 := (ω 2 , b 2 ) ∈ R d+1 , and λ ∈ [0, 1], there are two cases to show the convexity. One is that at least one of u 1 and u 2 is not in Ξ, then λF (u 1 ) + (1 − λ)F (u 2 ) = ∞ (by properties and assumes inR). In this case, The other case is that both u 1 and u 2 are in Ξ. In this case, since Ξ is convex, we have F ( That is, F (u) is convex.
We can now directly obtain that the solution set to optimization (2) is non-empty, compact and convex.
Finally, if · * is strictly convex, choose any distinct u 1 := (ω 1 , b 1 ), u 2 := (ω 2 , b 2 ) in the solution set and denote the minimal of Optimization (2) by τ . Since the solution set is convex, u 1 /2 + u 2 /2 is also in the solution set. We have This indicates that ω 1 /2+ω 2 /2 * is the new minimal of Optimization (2), which is a contradiction. Therefore, u 1 = u 2 . The proof of the uniqueness is completed.

2.2.
The linear support vector machine classifiers. In real-world applications, because of the noise, the classes overlaps, the data set may not be linearly separable. If D is not linearly separable, Optimization (2) has no solution. Same as the soft margins of the classical SVM classifiers, one way to deal with the overlap is to still maximize the non-Euclidean margin, but we allow some points to be on the wrong side of the decision boundary.
Define the slack variables ξ = (ξ i ) N i=1 . Figure 2 shows the difference between hard margin and soft margin.
There are two natural ways to modify the constraints in Optimization (1) including y i (x i ω+b) ≥ M −ξ i or y i (x i ω+b) ≥ M (1−ξ i ), and ξ i ≥ 0, for i = 1, 2, . . . , N . The first choice seems more natural, but it results in a non-convex optimization problem, while the second is convex. That is because if we use the first choice, the constraints in Optimization (2) will finally become y i (x i ω+b) ≥ 1−ξ i / ω * , which leads to the non-convexity. Thus, we can reform the constraints in Optimization (2) to y i (x i ω + b) ≥ 1 − ξ i and ξ i ≥ 0 for i = 1, 2, . . . , N . The ξ i is 0 in case of no error. If a pattern falls within the margin, the ξ i for this pattern is positive. Hence, by minimizing the N i=1 ξ i , we minimize the total proportional amount by which predictions fall on the wrong side of their margin. As Optimization (2), we can drop the norm constraint on ω and define M = 1/ ω * . To maximize the margins M a r g i n and minimize the mistakes, we rewrite Optimization (1) in the form as where the "cost" parameter C ≥ 0 decides the trade-off between maximizing the margin and minimizing the mistakes. When C is large enough, the soft-margin SVM classifiers behave as the hard-margin SVM classifiers. The linearly separable case corresponds to C → ∞, that is, every ξ i is constrained to be 0. Now the target variable has changed from (ω, b) to (ω, b, ξ).

Remark 1.
It has another way to the noises, that is, a constant ϑ ≥ 0 is used to control the misclassifications by a constraint N i=1 ξ i ≤ ϑ. As Optimization (2), we can drop the norm constraint on ω and define M = 1/ ω * . Then, we have the constrained optimization By Theorem 28.1 in [12], using Lagrange multipliers, if C is the Kuhn-Tucker coefficient corresponding to the constraint N i=1 ξ i − ϑ ≤ 0, we can obtain the solution set for Optimization (5) by solving Optimization (4). Since Optimization (4) is computationally easier and it gains more attention than Optimization (5), in this paper we only discuss Optimization (4).
The objective function of Optimization (4) is actually a norm plus a linear sum and the constraints are still linear inequalities. Thus, Optimization (4) is convex. We will show the solution set to Optimization (4) is also non-empty and convex. Proposition 2. The solution set to Optimization (4) is non-empty, compact and convex. In particular, if · * is further strictly convex, Optimization (4) has one and only one solution .
Proof. Without loss of generality, we can assume that C > 0. The case of C = 0 leads to the trivial case that the minimal is 0 and ω = 0.
By the construction of Optimization (4), we can randomly select (ω,b) ∈ R d+1 and letξ i = max{1 − y i (x iω +b), 0}, thenv := (ω,b,ξ) is a feasible solution to Optimization (4). Thus, the feasible solution set to Optimization (4) is non-empty, we denote it by Ξ. Since the constraints are all linear inequalities, Ξ is convex. We can equivalently transfer (4) to Similarly, we are now working inR.
To complete the proofs, we only need to discuss the solution set to Optimization (6). By Theorem 27.2 in [12], we only need to show that the objective function is a convex closed proper function which has no direction of recession, then the solution set to (4) is non-empty, closed, bounded (so compact) and convex. We where we denote ξ 1,i by the ith element of ξ 1 , ξ 2,i by the ith element of ξ 2 , and λ ∈ [0, 1], there are two cases to show the convexity. One is that at least one of v 1 and v 2 is not in Ξ, then λF (v 1 ) + (1 − λ)F (v 2 ) = ∞ (by properties and assumes inR). In this case, The other case is that both v 1 and v 2 are in Ξ. In this case, since Ξ is convex, Next we show that F (v) has no direction of recession. It is equivalent to show that for anyv := (ω,b,ξ) = 0, there exists a v 0 := (ω 0 , b 0 , ξ 0 ) such that g(λ) := F (v 0 + λv) is not a non-increasing function. That is, there are some λ 0 > λ 1 such that g(λ 0 ) > g(λ 1 ). Given anyv = 0, choose a v 0 such that there exist λ 0 > λ 1 such Therefore, F (v) has no direction of recession.
We can now directly obtain that the solution set to optimization (4) is non-empty, compact and convex.
Finally, if · * is strictly convex, F (v) is also strictly convex. We choose any in the solution set and denote the minimal of Optimization (4) by τ . Since the solution set is convex, v 1 /2 + v 2 /2 is also in the solution set. We have This indicates that F (v 1 /2+v 2 /2) is the new minimal of Optimization (4), which is a contradiction. Therefore, v 1 = v 2 . The proof of the uniqueness is completed.

2.3.
Regularizations. Now we transfer Optimization (4) to an unconstrained optimization by the Karush Kuhn Tucker (KKT) conditions. Suppose v * = (ω * , b * , ξ * ) is one of the solutions to Optimization (4). The Lagrange (primal) function L P (v) of Optimization (4) is where α i ≥ 0, µ i ≥ 0 are Lagrange multipliers. Setting the derivative of L P (v) with respect to ξ i to zero, we have The KKT conditions for L P (v) include Combining Equations (7) and (8), we have that if Those data points whose corresponding ξ i 's are non-zero are the so-called support vectors, by which the classifier is totally determined. Now although it may seem a bit strange, we can actually rewrite Optimization (4) as an unconstrained optimization problem whose target variable is (ω, b) This is indeed a regularized model which is the sum of a hinge loss function and an arbitrary norm penalty term. It is a convex optimization. Before we show that Optimization (4) can be rewritten as Optimization (9), we need to show Optimization (9) has solutions. (9) is non-empty, compact and convex. In particular, if · * is further strictly convex, Optimization (9) has one and only one solution.

Proposition 3. The solution set to Optimization
Proof. Since Optimization (9) is unconstrained, the feasible solution set to it is R d+1 which is non-empty and convex, then there exists a feasible solutionû := (ω,b). By Theorem 27.2 in [12], we only need to show that the objective function F (u) : is a convex closed proper function which has no direction of recession, then the solution set to (9) is non-empty, closed, bounded (so compact) and convex.
F (u) is indeed the sum of a hinge loss function and an arbitrary norm penalty term, so it is convex. Since F (û) is finite and F (u) ≥ 0 > −∞, F (u) is proper. By [12, p 52], a convex proper function is closed if it is lower semi-continuous. Because F (u) is actually continuous, the lower semi-continuity can be naturally obtained. So F (u) is a convex closed proper function.
Next we show that F (u) has no direction of recession. It is equivalent to show that for anyū := (ω,b) = 0, there exists a u 0 := (ω 0 , b 0 ) such that g(µ) := F (u 0 + µū) is not a non-increasing function. That is, there are some µ 0 > µ 1 such that g(µ 0 ) > g(µ 1 ). Given anyū = 0 and randomly-chose u 0 such that We can directly obtain that the solution set to Optimization (9) is non-empty, compact and convex.
Finally, if · * is strictly convex, F (u) is also strictly convex. We choose any distinct u 1 := (ω 1 , b 1 ), u 2 := (ω 2 , b 2 ) in the solution set and denote the minimal of Optimization (4) by τ . Since the solution set is convex, u 1 /2 + u 2 /2 is also in the solution set We have This indicates that F (u 1 /2+u 2 /2) is the new minimal of Optimization (4), which is a contradiction. Therefore, u 1 = u 2 . The proof of the uniqueness is completed.
Next we will discuss the relationship between the solutions to Optimization (4) and the solutions to Optimization (9).
is also one of the solutions to Optimization (9). (2) If u * = (ω * , b * ) is one of the solutions to Optimization (9), then let ξ is also one of the solutions to Optimization (4). Proof.
(1) By the constraints of Optimization (4), we have ξ i ≥ 0 and is one of the solutions to Optimization (4). For any feasible solution v = (ω, b, ξ) to (4), we have

For any feasible solution
Then for any feasible solution u = (ω, b) to Optimization (9), Multiply the left and right sides by λ = 1/C, we get It follows that u * = (ω * , b * ) is also one of the solutions to Optimization (9).

2.4.
Examples of p-norm margins. Next we will discuss some special cases, the SVM classifiers by p-norm margins for 1 ≤ p ≤ ∞, especially for p = ∞ and p = m where m is an even and non-zero natural number. We will show how to understand the sparsity brought by 1-norm regularization from the SVM classifier by ∞-norm margin and introduce a new method to help solve the SVM classifiers by m-norm margins.
In practice, p-norm regularization is a crucial technique to prevent over-fitting [2,7,13]. Different regularization terms will lead to different performances. For instances, only 1-norm regularization can result in sparsity. That is, · * is 1 norm, the margin is ∞-norm margin. The reasons for sparsity were discussed in the parameter space in the past. In fact, only the extreme points of the ball of 1 norm are sparse then the model can obtain sparse solutions. Here, we will illustrate the causes of sparsity. Several examples in 2-dimensional space will be provided to help understand the sparsity in data space. It was shown that the decision boundaries are totally determined by a small number of support vectors, so we only need to discuss the case of a few data points, especially 2 distinct points. See Figure 12 to understand the projections by different norms.
There are two different cases for 2 distinct points in 2-dimensional space, case one is that one of their coordinates is equal, case two is that all of their coordinates are not equal. We present these two cases under the sense of 2 norm and ∞ norm in Figure 3a and Figure 3b, respectively. For 2 norm, only in case one the solutions will be sparse, which means the obtained hyperplane is perpendicular to or parallel to a certain coordinate axis. But for ∞ norm, in both two cases the solutions are sparse.
By Proposition 3, since 1 norm is not strictly convex, the uniqueness of the solutions cannot be guaranteed. Now we will discuss the uniqueness in the data space rather than the parameter space, and give a geometrical interpretation of why the solutions may not be sparse when they are not unique. As we can see in Figure 4a, in a 2-dimensional space, when the slope of the line connecting two distinct points is ±1, the tangent point of two balls of ∞ norm is the common vertex of two squares. Then any line which lies entirely inside the shadowed area in Figure 4a is one of the solutions to the optimization. These cases correspond to the cases that the parameter points lie in the blue line segment (top right edge of the rhombus) in Figure 4b. In this case, almost all of these solutions are not sparse.
However, it does not mean that the SVM classifier by ∞-norm margin will never have a unique solution. In fact, it can obtain a unique solution in some special cases. In Figure 5a, the solution obtained is the only possible one, and it is sparse.    In parameter space, the corresponding parameter point is the blue point in Figure  5b, which exactly lie on one of the coordinate axis. Next, we will discuss the spacial cases for p = m where m is an even natural number. In the standard SVM classifier by Euclidean margin, there is a fast algorithm called Sequential Minimal Optimization (SMO) [10] to solve the Optimization (4). SMO is actually solving the Lagrange (Wolfe) dual function of Optimization (4) where · * is 2 norm. But the Lagrange (Wolfe) dual function is derived by setting the derivative of the Lagrange (primal) function to zero. So not all norms can guarantee that Optimization (4) can obtain Lagrange (Wolfe) dual function.
where α i ≥ 0, µ i ≥ 0, i = 1, 2, . . . , N are Lagrange multipliers. Let α = (α i ) N i=1 . Since m is an even natural number, m − 1 is an odd natural number. We have that for any x ∈ R, |x| is differentiable and its derivative has a simple form. By setting the derivatives with respect to ω, b, ξ i to zero, we get Substituting Equation (10) into L P , we obtain the Lagrange (Wolfe) dual function where x i k ,j is the jth element of x i k . We can exactly write L D in a tensor form. For convenience, the notions and operations of tensors are defined as in the book [11]. Let where A m α m is the m-mode product. Therefore, the Optimization (4) where · * is m m−1 norm can be solved by solving Optimization 3. Nonlinear support vector machine classifiers.

Kernel tricks and p-Norm
RKBSs. The SVM classifiers by non-Euclidean margins described so far find linear boundaries in the input space. As with other linear methods, we can make the procedure more flexible by enlarging the feature space using basis expansions. Generally linear boundaries in the enlarged space achieve better training-class separation, and translate to nonlinear boundaries in the original space. Once the basis functions h := (h j (x)) n j=1 are selected, the procedure is the same as before. The only difference is that we now fit the SVM classifiers using input feature h(x i ) := (h j (x i )) n j=1 , i = 1, 2, . . . , N , and obtain the nonlinear function h(x) ω + b. The classifier is sign(h(x) ω + b).
If we use the basis functions h, L D has the form And by (10), , here we still use element-wise power. Thus, we can write the nonlinear function as In (12), letting K m (x, x i2 , . . . , x im ) := n j=1 m k=2 h j (x i k )h j (x), we can obtain the mth order tensor kernel function. In fact, we need not specify the transformation h(x) at all, but require only knowledge of the tensor kernel function to find the nonlinear function. That is what we called kernel trick.
If m = 2, we have The functional (13) involves basis functions h(x) only through inner products. The inner product can be treated as the standard kernel function, a special case of tensor kernel function. In practice, the standard kernel functions are more commonly used. Henceforth, we only discuss the two-variable kernel functions. Figure 6 shows the geometrical interpretation of the kernel tricks. Kernel trick is an important and powerful method to fit the nonlinear function. It depends on the data points, especially the relationship between the data points, and not on the choice of the basis functions. In point of fact, by kernel trick, the SVM classifier fitting can be viewed as function estimation in some special function space. For the SVM classifier by Euclidean margin, the space is called Reproducing The training data set.  Figure 6. The geometrical interpretation of the kernel tricks. In the original space, the data set is not linear separable, see Figure  6a. After the feature mapping using kernel function, the data set is linear separable in higher dimensional space, see Figure 6b. Finally, the decision boundary in higher dimensional space can be projected into the original space and becomes the non-linear boundary in the original space.
Kernel Hilbert Spaces (RKHSs) [13,16]. It is the inner product form of the standard kernel function that leads to Hilbert space rather than Banach space. To possess more geometrical structures including sparsity, Reproducing Kernel Banach Spaces (RKBSs) are introduced [18,20], in which the p-norm RKBSs are the most popular, see Appendix B for details. In fact, RKBSs to SVM classifiers by non-Euclidean margins are equivalent to RKHSs to SVM classifiers by Euclidean margins. That is, in this paper, we have theoretically supplemented the construction of RKBSs.
The RKBSs are the spaces spanned by the finite or infinite orthonormal basis. For the finite-dimensional case, since any finite-dimensional normed space is Banach space, it is easy to construct. However, we cannot generalize directly from finite-dimensional Banach spaces to infinite-dimensional Banach spaces. A simple counter-example is that in L 1 the existence of solutions to a convex optimization problem is easy to prove, but its infinite-dimensional version l 1 is not reflexive so the existence of solutions to convex optimization problems in it is difficult to show. Without the reflexivity, the existence of solutions to the SVM classifier fitting problem in the Banach space cannot be guaranteed [6, p 75, Proposition 6]. About the construction of infinite-dimensional p-norm RKBSs, we refer the reader to [18].
3.2. SVM classifiers in RKBSs. By Theorem 2.1, we will focus on the unconstrained optimization only.
Once the basis functions are selected to be the kernel functions K(x, x i ), i = 1, 2, . . . , N , we obtain the spanned space equipped with the norm for 1 ≤ p < ∞. B p is a finite-dimensional normed space, so it is naturally a Banach space. In the finite-dimensional RKBSs, the non-linear SVM classifiers by non-Euclidean margins have a similar form like the linear classifiers. The corresponding optimization also has the form as Optimization (9) min Because f ∈ B p is a finite linear combination of basis functions, the target variable is indeed the finite-dimensional vector a := (a i ) N i=1 . Thus, Optimization (14) is actually a finite-dimensional optimization We can still use Proposition 3 to guarantee the existence and conditional uniqueness of its solutions. The infinite-dimensional p-norm RKBSs are defined by Equations (17) and (18) in Appendix B. The optimization is the same as Optimization (14), but now it is an infinite-dimensional problem. Proposition 3 cannot be applied to it. For 1 < p < ∞, B p which is isometrically equivalent to l p is reflexive. The existence and conditional uniqueness of the solutions to Optimization (14) can be guaranteed [6, p 75, Proposition 6]. But B 1 , which is isometrically equivalent to l 1 , is not reflexive. Nevertheless, the SVM classifiers by non-Euclidean margins in B p still can be guaranteed to have solutions. About the infinite-dimensional RKBSs, we refer to [18].
Moreover, there are several representor theorems to present the form of the solution to Optimization (14) in infinite-dimensional RKBSs. Such as [8,14,15,18,20] and so on.
Remark 2 (The domain of data). When we discuss how to build the SVM classifiers by non-Euclidean margins, we use Theorem A.1. It requires the domain of the data points is R d , a finite-dimensional space. Such a requirement is also used in the proofs of Propositions 1, 2 and 3. In an infinite-dimensional space, the compactness cannot be derived from closeness and boundedness. It fails to extend directly from finite-dimensional domain to infinite-dimensional domain.
In fact, Theorem A.1 can be generalized to the Banach spaces, see [3, Lemma 2.2] and [5, Lemma 1]. The only difference is the form of the functional used to define the hyperplane. Now, the domains of the target variables of the optimizations can be Banach spaces. And by [6,p 75,Proposition 6], the existence of solutions can be guaranteed if the Banach space is reflexive. However, l 1 is not reflexive. It follows that some constraints are added to the choice of the norm. It should be noted that the finite-dimensional space is a special case of reflexive Banach spaces.
When we turn to the RKBSs, since we are estimating the function in RKBSs, the domain of the data becomes less important. Actually, the domain of the data now can be locally compact Hausdorff space, see [18]. 4. Numerical experiments. We shall show some numerical experiments on both artificial data and real data to evaluate the SVM classifiers by non-Euclidean margins. Due to the interest in sparsity, we choose the SVM classifier by ∞-norm margin, which can be equivalently transferred to the 1-norm SVM classifier, in finite-dimensional space as a special example. We have actually introduced three convex optimizations to find the classifiers such as Optimization (2) for linear separable case, Optimization (4) for non-separable case, and Optimization (9) for nonseparable case. Thus, the experiments will be performed on noiseless and noisy data sets as well as on MNIST.
The performance and sparsity of the 1-norm SVM classifier are discussed by some recent papers. [3] gave statistical error bounds for 1-norm SVM classifier. [19] discussed the ∞-norm margin behind 1-norm SVM classifier, which is actually a special case of non-Euclidean margins. [21] showed the study of sparsity by 1-norm SVM classifier. [22] deeply studied the 1-norm SVM classifier.

4.1.
Experiments on artificial data. The linear separable sparse data set is generated randomly. Then some noise are added to the separable data set to generate the non-separable data set. Figure 7 shows these two data sets.  For the noiseless data set, because of its linear-separability, the hard-margin solution exists. That is, Optimization (2) has solutions. Conversely, for the noisy data set, Optimization (2) does not have solution. Thus, we will perform three experiments on the noiseless data set, and two experiments on the noisy data set. The results are shown in Figure 8 for noiseless data set, and in Figure 9 for noisy data set, respectively. Recall that here · * is 1 norm.
As we can see, the classifiers in Figure 8b and 8c are totally same. And they all allow some data points to fall in the margin and even cross the boundary. Although the soft-margin classifier misclassifies several -1-class data points, intuitively the classifier is close to the real solution, and it is sparse.
Similarly, the two classifiers in Figure 9a and 9b are the same and all sparse. Some data points are allowed to fall in the margin.

4.2.
Experiments on MNIST. The MNIST database (http://yann.lecun.com/ exdb/mnist/) of handwritten digits has 10 classes corresponding to digits 0, 1, . . . , 9. It contains 60000 training data and 10000 testing data. Each data is a row vector in [0, 1] 784 . Since SVM classifiers by non-Euclidean margins are designed for binary classification problems, we choose two digits 6 and 9. Thus, we have 9939 training data and 1967 testing data. Some selected images of these two classes are displayed in Figure 10. We can see that compared to the normal image of 9, some images of 9 are quite strange. We run two experiments on MNIST. One is a linear SVM classifier by ∞-norm margin in Optimization (9). The other one is a non-linear SVM classifier in Optimization (15) by Gaussian kernel called it kernel SVM classifier by ∞-norm margin here. The results obtained by linear and kernel SVM classifiers by ∞-norm margin are shown in Table 1. where σ = 9.6.

4.3.
Experiments on handwritten alphabets database. The handwritten alphabets database used in this paper comes from kaggle(https://www.kaggle. com/sachinpatel21/az-handwritten-alphabets-in-csv-format), the author of which has collected the images from several different sources. Similar to MNIST, this database of handwritten capital alphabets has 26 classes corresponding to letters A, B,..., Z. It contains more than 372 thousand images. Each data is a row vector in [2,255] 784 . Due to the property of SVM classifiers by non-Euclidean margins and the restriction of computing resources, we take 5,000 data points from each of O and Q, a total of 10,000 data points, and then randomly divided them into training set and test set in a 7:3 ratio. That means we have a training set with 7,000 data points and a test set with 3,000 data points. Some selected images of these two classes are displayed in Figure 11. It is easy to see that some images are quite indistinguishable. Similar to experiments on MNIST, we also run two experiments on it. The results are shown in Table 2. where σ = 2600.
We can see that both linear and kernel SVM classifiers by ∞-norm margin obtain a sparse result. Although Kernel SVM classifier by ∞-norm margin requires many more computing resources, it obtains a higher accuracy and a higher sparse rate.
Appendix A. Distance from a point to a hyperplane. In this appendix, a theorem is introduced to give an explicit form for the distance from an arbitrary point to a given hyperplane, where the distance is derived from a general norm. Considering the distance derived from a general norm · , the distance between x and H is given by where · * is the dual norm to · , which is defined as z * := sup{z x : x < 1}.
For the original theorem and the corresponding proof, we refer the reader to [9].
It is worth pointing out that the norm used here is not limited to be p norm for 1 ≤ p ≤ ∞, but can be any norm. To help understand the geometrical interpretation, Figure 12 shows some examples of p norms where p = 1, 2, ∞. Moreover, the distances would be the same for any 1 ≤ p ≤ ∞ in some cases. Such as, in 2dimensional space, when the slope of the line is ±1, the distances derived from p norms are the same.

Point
Hyperplane 2 norm projection 1 norm projection norm projection Figure 12. The geometrical interpretation of distance from a point to a plane.
Appendix B. Reproducing Kernel Banach Spaces. We review some definitions and theorems of Reproducing Kernel Banach Spaces (RKBSs) in [18]. Suppose that Ω and Ω * are locally compact Hausdorff spaces. Let ·, · B be the dual bilinear product of the Banach space B and its dual space B * .
Definition B.1 ([18], Definition 2.1). Let a measurable kernel K : Ω × Ω * → R, and let a normed space B be a Banach space composed of measurable functions f : Ω → R such that the dual space B * of B is isometrically equivalent to a normed space F composed of measurable functions g : Ω * → R. We call B a right-sided RKBS and K its right-sided reproducing kernel if Next, we look at the p-norm RKBS. Let φ n : Ω → R and φ * n : Ω * → R be linearly independent measurable functions for n ∈ N. Denote that the space l p for 1 ≤ p < ∞ and l ∞ are the collection of all countable sequences of scalars with the standard norms · p and · ∞ , respectively, and the space c 0 is the subspace of l ∞ with all countable sequences of scalars that converge to 0. We define the normed spaces equipped with the norm then [18, Theorems 3.8, 3.20, and 3.21] guarantee that B 1 is a right-sided RKBS with the right-sided reproducing kernel K(x, y) := ∞ n=1 φ n (x)φ * n (y), B p for 1 < p < ∞ is a two-sided RKBS with the two-sided reproducing kernel K(x, y), and B * ∞ is a two-sided RKBS with the two-sided reproducing kernel K * (y, x) := ∞ n=1 φ * n (y) φ n (x). Thus, we call B p a p-norm RKBS for 1 ≤ p < ∞, or B * ∞ a ∞-norm RKBS. These reproducing kernels K and K * are generalized Mercer kernels. The functions φ n and φ * n are called the left-sided and right-sided expansions of the generalized Mercer kernel K.
It is clear that B 1 and B * ∞ are isometrically equivalent to l 1 and c 0 , respectively. This shows that B 1 is isometrically equivalent to the dual space of B * ∞ . But B * ∞ is a proper imbedding subspace of the dual space B 1 .