General risk measures for robust machine learning

A wide array of machine learning problems are formulated as the minimization of the expectation of a convex loss function on some parameter space. Since the probability distribution of the data of interest is usually unknown, it is is often estimated from training sets, which may lead to poor out-of-sample performance. In this work, we bring new insights in this problem by using the framework which has been developed in quantitative finance for risk measures. We show that the original min-max problem can be recast as a convex minimization problem under suitable assumptions. We discuss several important examples of robust formulations, in particular by defining ambiguity sets based on $\varphi$-divergences and the Wasserstein metric.We also propose an efficient algorithm for solving the corresponding convex optimization problems involving complex convex constraints. Through simulation examples, we demonstrate that this algorithm scales well on real data sets.


Introduction
In machine learning, the robustness of the solutions obtained for classification and prediction tasks remains a main issue. In Papernot, McDaniel, and Goodfellow (2016) and Kurakin, Goodfellow, and Bengio (2016), some examples are provided where small modifications of the input data can completely alter the resulting solution. In Feng, Xu, Mannor, and Yan (2014) and Plan and Vershynin (2013), poor out-of-sample performances are displayed when training data is sparse. This kind of problems also occurs in optimal control when there exist uncertainties on parameters. In Ben-Tal and Nemirovski (2000), the authors showed that a small perturbation on the parameters can turn a feasible solution into an infeasible one.
In this context, robust approaches appear as a way of controlling out-of-sample performance. There is an extensive literature dealing with robust problems and the reader is refered to Ben-Tal, El Ghaoui, and Nemirovski (2009) for a survey. One of the main approaches consists of introducing constraints on the probability distribution of the unknown data. Under some conditions, this approach is equivalent to deal with ambiguity sets or a modified loss function. The works in Ben-Tal, Den Hertog, De Waegenaere, Melenberg, and Rennen (2013); Hu and Hong (2013); Duchi, Glynn, and Namkoong (2016); Moghaddam and Mahlooji (2016) and  have brought more insight on ambiguity sets. In  and Esfahani, Shafieezadeh-Abadeh, Hanasusanto, and Kuhn (2017), the authors present a distributionally robust optimization framework based on the Wasserstein distance. A set of probability distributions is defined as a ball centered on the reference probability with respect to the Wasserstein distance, then the optimization is carried out for the worst cost over this probability set.
This idea of minimizing the worst cost over a given probability set is well-known in quantitative finance. The robust representation of risk measures provides a theoretical framework to do so. A rich class of risk measures is the class of coherent ones which were introduced in the seminal paper by Artzner, Delbaen, Eber, and Heath (1999). In Föllmer and Schied (2016), a broader class of so-called convex risk measures was investigated, for which a large number of results were established.
In this paper, we follow the line of , which aims at reformulating robust problems using ambiguity sets as convex minimization problems. Our contribution is threefold. First we clarify the links existing between risk measures and robust optimization. This allows us to transpose results from finance to machine learning. Second, we propose a unifying convex optimization setting for dealing with various risk measures, including those based on ϕ-divergences or the Wasserstein distance. Finally, we propose an accelerated algorithm grounded on the subgradient projection method proposed in Combettes (2003). We show that the proposed algorithm is able to solve efficiently large-scale robust problems.
The organization of the paper is as follows. In Section 2, we state the general mathematical problem we investigate in the context of machine learning. In Section 3, we first draw a parallel between this problem and convex monetary risk measures. We then provide a convex reformulation of the problem. In Section 4, we discuss some important classes of risk measures by revisiting some of the results in the literature. In Section 5, we describe our algorithm for solving convex formulations of robust problems. Then, in Section 6, we illustrate the good performance of the proposed algorithm through numerical experiments on real datasets. Finally, short concluding remarks are made in Section 7.

Problem statement
Let (Ω, F, p) be the underlying probability space where Ω is a finite set of cardinal N , F is the σ-field generated by ({ω}) ω∈Ω , and p is a probability distribution that is assumed to charge all points. Let d be a nonzero integer and let z : Ω → R d denote a general random variable. Note that function z can be identified with a matrix in R N ×d where, for every i ∈ [[1, N ]], z i ∈ R d is the vector corresponding to the i-th line of matrix z . We denote by M 1 the set of probability distributions over (Ω, F, p).
For every i ∈ [[1, N ]], let (·, z i ) : R n → R ∪ {+∞} be a loss function which is assumed to be lower semicontinuous (lsc) and convex such that where dom(g) denotes the domain of a function g, that is the set of argument values for which this function is finite. In standard formulations of machine learning problems, one aims at finding an optimal regression vector θ ∈ R n such that Indeed, setting z i = [x i y i ] with n = d − 1 ≥ 1, x i ∈ R n , and y i ∈ R allows us to recover a wide array of estimation and classification problems. For example, penalized least squares regression problems are obtained when where ρ : R n → R ∪ {+∞} is a proper, lsc, convex penalty function. If the random variable y is {0, 1}-valued, we can recover binary classification problems, for example by performing a logistic regression, i.e.
One of the main limitations of this formulation is that it assumes that the true probability distribution of the data is perfectly known. In practice, this distribution is often estimated empirically from the available observations. In this paper, we will focus on the following more general robust formulation to determine an optimal regression vector.
Problem 1. Let α : R N → R∪{+∞} be a lsc convex penalty function whose domain is a nonempty subset of M 1 . We want to find In this problem, if α is the indicator function ι {p} of the singleton containing a probability distribution p, then (2) is recovered. 1 More generally, if α is equal to the indicator function of a nonempty closed convex set Q ⊂ M 1 , then the objective function in (5) reduces to where σ Q is the support function of Q. This corresponds to the well-known case of distributionally robust optimization using ambiguity sets .

Convex formulation of robust inference problems using risk measures
In this section, we address Problem 1 in the light of the financial framework for monetary risk measures. We first recall known properties of risk measures and then show how Problem 1 can be reformulated as a convex problem.

Definition and properties of a risk measure
Let X be the space of real-valued random variables defined on the probability space (Ω, F, p). We denote by X a generic element of X and we recall that p is assumed to be a distribution that charges all points. The space X is endowed with the pointwise order ≤, that is, A risk measure F is a real-valued function F : X → R . The next four properties of risk measures were first introduced in Artzner, Delbaen, Eber, and Heath (1999) to define the so called coherent risk measures. The interested reader can also refer to Föllmer and Schied (2016)[Part I, Chapter 4].
Definition 1. A risk measure F : X → R is said to be • translation invariant: if, for every X ∈ X and m ∈ R, F[X + m] = F[X ] + m, • positively homogeneous: if, for every X ∈ X and λ ∈ [0, +∞[, A risk measure which satisfies the first two properties is called a monetary risk measure. A risk measure which satisfies the first three properties is called a convex risk measure. A risk measure which satisfies the four properties is called a coherent risk measure.
Depending on the author, the first axiom may also be expressed as: for every if the variables X and Y are interpreted as gains instead of a losses, which is a common in finance. For this reason, some sign differences may appear between results of various authors. We have chosen to follow the paths in Rockafellar and Uryasev (2000); Ruszczynski and Shapiro (2006); Ruszczyński and Shapiro (2006) and interpret the random variable in argument as a loss. We however often refer to Föllmer and Schied (2016), providing a comprehensive view of risk measures, where the opposite convention has been adopted.
(i) It readily follows from the translation invariance property that a monetary risk measure F admits a primal form given by where lev ≤0 F is the lower level set of F at height 0 defined as (ii) A monetary risk measure F is 1-Lipschitz continuous with respect to the supremum norm · ∞ . Indeed, for every (X , Y ) ∈ X 2 , we have X ≤ Y + X −Y . By monotonicity and translation invariance we obtain that The class of convex risk measures includes a large number of useful functions. Without entering into details, we should mention: expectation, worst case, quantile, median, and average value at risk Föllmer and Schied (2016).

Convex reformulation
In this section, we will show that the "min-max" problem 1 admits a convex reformulation. We first gather in the following proposition some existing results in the literature.
Proposition 3. F is a convex risk measure if and only if there exists a lsc and convex function α : The function α associated with F is uniquely defined as In addition, F is coherent if and only if its conjugate function α is the indicator function of a nonempty closed convex subset of M 1 . Proof.
(i) We know from Föllmer and Schied (2016, Theorem 4.16 and Proposition 4.15) that any convex risk measure F on X is of the form where α : R N → R ∪ {+∞} is the lsc and convex function whose domain is a nonempty subset of M 1 , given by (the second equality stems from Remark 2(i)).
Conversely, one can associate to every lsc convex function α : R N → R∪{+∞} whose domain is a nonempty subset of M 1 a unique convex risk measure defined by (12).
(ii) It follows from Föllmer and Schied (2016, Proposition 4.15) that if, in addition, the risk measure F is coherent, then the function α in (11) is the indicator function of a nonempty closed convex subset of M 1 and the converse property holds.
We now state the main result of this section.
Theorem 4. Let α : R N → R ∪ {+∞} be a lsc convex function whose domain is a nonempty subset of M 1 . Problem 1 is equivalent to find where The function F α (·, z) is proper, lsc, and convex. In addition, the so-defined convex optimization problem admits a primal formulation which consists of finding .
Proof. It follows from Proposition 3 that (5) is equivalent to (14) where F α is a convex risk measure. In addition, the sup in the definition of the risk measure is attained since M 1 is a compact set and q → N i=1 q i x i −α(q) is upper semicontinuous. The function (·, Z) is lsc convex for every Z ∈ R d . Given a random variable z, for every vectors θ 1 and θ 2 in R n , and scalar λ ∈ [0, 1], the convexity of function yields Now, by using the fact that the risk measure F α is monotone and convex with respect to the ordering introduced in (7), we get This shows that F α (·, z) is convex. In addition, since F α is monotone and continuous (see Remark 2(ii)) and (·, z) is lsc, F α (·, z) is lsc. Because of (1), F α (·, z) is also proper.
The general convex reformulation (16) is not always easy to handle. In practical applications, the choice of the mapping α plays a crucial role in this regard. We will see in the next section some useful examples of this function. In particular, some mappings α lead to a formulation (16) that will be shown to be tractable numerically.

Examples of risks measures
By considering particular forms of function α in Problem 1, we define three scenarios of interest for robust formulations. The first two ones are based on ϕ-divergences, while the third one is based on the Wasserstein metric.

Perspective functions and divergences
The notion of ϕ-divergence was first introduced independently by Csiszár (1964); Morimoto (1963) and Ali and Silvey (1966). For a more complete bibliography on the subject, we refer to Basseville (2013).
The perspective function f ϕ of function ϕ is given by where the function f ϕ is the lsc envelope of the mapping f ϕ , that is We also recall the definitions of a conjugate function and an adjoint function.
and the so-called adjoint function of ϕ is defined by Table 1 is an extension of the one in Ben-Tal, Den Hertog, De Waegenaere, Melenberg, and Rennen (2013) and provides the expressions of common ϕ functions, their conjugates, and the associated ϕ-divergence. It is well-known (Ben-Tal, Den Hertog, De Waegenaere, Melenberg, and Rennen, 2013;Combettes and Müller, 2018) that the adjointφ of ϕ is such that and the conjugate of function λϕ is

Divergence penalty functions
A first case of interest is when the penalty term α(q) in Problem 1 measures the "distance" between p and q in the sense of a ϕ-divergence. Divergence Cressie and Read ϕ θ with λ 0 ∈]0, +∞[. Problem 1 is equivalent to find where g is the proper, lsc, convex function given by Proof. We can reexpress (5) as It follows from Föllmer and Schied (2016, Theorem 4.122) that The equivalence between (31) and (29) then results from Theorem 4. In addition, plugging the expression of ϕ * in (30) yields is a lsc convex function. Since convexity and lower semicontinuity are kept by the supremum operation, g is lsc and convex. By using (1), (30), and the fact that ϕ * is proper, there exist θ ∈ R n and µ ∈ R, such that g(θ, µ) < +∞.

Constrained formulations
We now investigate two particular cases when α is the indicator function of a convex set Q of probability distributions, so defining an ambiguity set.

Ball with respect to a divergence
A first possibility is to introduce an upper bound on the divergence D ϕ (q, p) between the sought distribution q and p by considering the constraint set where ∈]0, +∞[. The following result generalizes both Ben-Tal et al. (2013) where the authors deal with linear costs under constraints and Hu and Hong (2013) where the authors focus on the Kullback-Leibler divergence.
where g is the proper, lsc, convex function given by with the convention Proof. The risk function associated with α = ι B ϕ is Since 1 belongs to the interior of dom(ϕ) and Slater's condition holds for constraint (39b). Since the constraint is feasible and q → − N i=1 q i x i + ι M 1 (q) is lsc, convex, and coercive, there exists a solution q ∈ M 1 to the above constrained maximization problem. It then follows from standard Lagrange duality for convex functions that there exists λ ∈ [0, +∞[ such that (q, λ) is a saddle point of the Lagrange function where We have thus where, for every λ ∈ [0, +∞[, Two cases will be distinguished.
(44) can be reexpressed as where The conjugate of Φ reads whereas the conjugate of ι M 1 is given by σ M 1 in (46). Since σ M 1 is finite valued, the conjugate of Φ + ι M 1 is given by the following inf-convolution (Bauschke and Combettes, 2011, Theorem 15.3) which, by using (50), yields Since dom(λϕ) ⊂ [0, +∞[, (λϕ) * : ξ → sup υ∈[0,+∞[ ξυ − λϕ(υ) is an increasing function. This implies that Note that the right-hand side in the previous formula when applied at λ = 0 by using (38) and (46) reduces to Consequently, (43) leads to and (36) follows from Theorem 4. In addition, by using the expression of the conjuguate, g can be reexpressed as As a supremum of lsc convex functions, g also is lsc convex. The fact that g is proper follows from arguments similar to those at the end of the proof of Proposition 8.
Remark 10. The divergence risk measure in (32) is convex, whereas the risk measure in (55) is coherent (see Proposition 3), which means that the risk scales with the data in the latter case.

Ball with respect to the Wasserstein metric
We now investigate Problem 1 when function α is the indicator of a Wasserstein ball centered on p. For this purpose, we first recall the notion of Wasserstein distance.
Definition 11. Let M(Ξ 2 ) denote the set of probability distributions supported on Ξ 2 . The Wasserstein distance between two distributions p and q supported on Ξ is defined as where δ is a metric on Ξ.
We now introduce the notion of Wasserstein ball. The considered constrained set is denoted by with ∈]0, +∞[. In the following theorem, δ is the usual Euclidean distance. The following convex reformulation of Problem 1 can be derived from (Esfahani and Kuhn, 2015, Theorem 4.2).
Proposition 12. Let ∈]0, +∞[ and let α = ι B W . Then, Problem 1 is equivalent to find θ = arg min θ∈R n min λ∈R,s∈R N g(θ, λ, s) , where g is the proper, lsc convex function given by where W is the closed convex set defined as

Numerical solution
We will now propose an algorithm allowing us to solve numerically the three convex optimization problems in Propositions 8, 9, and 12. This algorithm applies to more general choices of function α in Problem 1 where the constraint S in (17) splits as an intersection of a finite number of convex constraints.

A unifying formulation
We first show that the convex optimization problems discussed in Section 4 can be reexpressed in a unifying manner.

Description of the algorithm
In this section, we propose an accelerated projected gradient algorithm for solving Problem (62). One step of this proximal algorithm Combettes and Pesquet (2010) reads as a projection onto a set defined as an intersection of non trivial closed convex sets. To solve this projection problem, we use the subgradient projection algorithm in Combettes (2003), which is related to ideas introduced in Haugazeau (1968, Theorem 3-2). This algorithm allows the constraints to be activated individually in a flexible parallel manner. We will first recall the basic structure of our algorithm before describing in more details the subgradient projection step.
Proximal algorithm Let H = R n × R × R × R N and let · (resp. · , · ) denote the standard norm (resp. the inner product) equiping this product space. By introducing the generic variable u = (θ, λ, µ, s) ∈ H, (62) can be reexpressed more concisely as min where c = (0, , 1, p) ∈ H and C = ∩ K k=0 C k with To solve the above problem, we propose to employ a FISTA-like algorithm Beck and Teboulle (2009). Let n ∈ N \ {0}. The n-th iteration of this algorithm reads where γ ∈]0, +∞[ and P C : H → C is the projection onto the closed convex set C. It follows from (Chambolle and Dossal, 2015, Theorem 3) that, if a solution to the minimization problem exists, and then the convergence of (u (n) ) n∈N to a solution to the problem is guaranteed. The main difficulty in the implementation of the algorithm lies in the computation of the projection onto C that will be discussed next.
Computation of the projection Algorithm 1 presents our projection method inspired from Combettes (2003). At iteration ∈ N, Q(p (0) , p ( ) , r ( ) ) designates the projection of p (0) onto the intersection of the 3 half-spaces C 0 , H , and D , where Since the projection onto H ∩ D has an explicit form Combettes (2003), a dual forward-backward algorithm  allows us to compute in a fast manner the projection onto C 0 ∩ H ∩ D . The algorithm has been intialized by setting p (0) = P C 0 (v (n) −γc), taking into account the fact that P C = P C • P C 0 . At each iteration , K designates the set of indices of the constraints which are activated.
When dealing with large-scale problems, it may be useful not to require all the constraints to be activated at each iteration. The convergence of the algorithm is guaranteed by the study in Combettes (2000), provided that, for every k ∈ [[1, K]], C 0 ⊂ dom(∂f k ) and there exists an integer M k such that The first assumption on the domains of the subdifferentials of the functions (f k ) k∈[[1,K]] is however not satisfied in (63). In this case, the direct simpler form of the algorithm in Combettes (2003) can be applied since the parameter λ is fixed.
6 Application to robust binary classification

Context
In this section, we illustrate the performance of our approach on different scenarios in the context of binary classification. To this aim, we consider the ionosphere and colon-cancer datasets 2 . The respective numbers of observations N and of features d are summarized in Table 2. Unless specified, we will consider the original datasets without pre-processing, using a training set with 60% of the original database and a testing set gathering the remaining entries. The splitting between training and testing samples is performed using function train_test_split of Scikit-learn 3 . We propose to compare the classical formulation in Equation (2) with the formulation in Problem 9 (resp. Problem 12) that uses ambiguity sets defined through the Kullback-Leibler divergence (resp. Wasserstein distance). We make use of the logistic regression loss in (4) (Briceno-Arias et al. (2017) for recent developments). The constrained minimization problems are solved running the proposed Algorithm 1 over a sufficient number of iterations so as to reach the stability criterion p ( +1) − p ( ) ≤ 10 −5 . All the tests are performed by using Julia programming language, on a computer with processors Intel® Core™ i7-3610QM CPU @ 2.30GHz × 8 and 16Gb of RAM.

Ionosphere dataset
We display the evolution of the difference between the current cost function and its final value (computed after a very high number of iterations), with respect to the iteration number (see Figure 1) and CPU time (see Figure 2). In the case of the Kullback-Leibler divergence, we choose K = [1, K] while for Wasserstein distance, we set the cardinality of K equal to 1500 (K = 44100 in this case). In this example, we observe that the convergence speed is slightly increased for large values of ε.
Regarding the comparison between the two ambiguity sets, it can be observed on Figure 2 that the method is faster in the case of the Kullback-Leibler divergence since the number of constraints grows linearly as a function of the number of observations, whereas the growth is quadratic in the case of the Wasserstein distance.    in the choice of the value of for maximizing the AUC, and the best performance are obtained for an intermediate non-zero value of this parameter. This clearly illustrates the benefit of the proposed formulation. Note that such results are consistent with the conclusions in Shafieezadeh-Abadeh,  and Gotoh, Kim, and Lim (2018). On this example, the Wassertein ambiguity set provides better results than the Kullback-Leibler divergence but it should be reminded that it comes at the expense of a higher computational cost.

Colon-cancer dataset
We now present in Table 3 the evolution of the AUC for tests performed on the colon-cancer dataset. This dataset only contains 64 observations. For such small dataset, the formulation in Problem 12 becomes very cheap in terms of computational cost. As can be noticed in Table 3, taking a nonzero value for leads to an increase of about 7% in terms AUC, which is significant in such challenging context. These results allow to assess the robustness property of the proposed formulation.

Results on an altered database
Let us come back to the processing of ionosphere dataset. In order to better illustrate the interest of the proposed formulation, we propose to modify the training set so that the proportion of labels (−1) and (+1) is altered and unbalanced. Such situation could typically arise during a transient regime, such as the beginning of an epidemic, or in the case of an incomplete dataset. After dividing the dataset between a training set and a testing set, using the same 60% ratio as in our previous tests, we will drop randomly a certain number of observations associated with the label (−1) so that the proportion of this label becomes ten times lower than its original proportions. Figure 4 displays ROC curves and Table 4 evaluates AUC metric, for various values of . Our formulation clearly outperforms the classical logistic regression classifier (retrieved when = 0). Noticeably, the later presents the same area under the curve as a random classifier, and thus exhibits a similar behavior to such a classifier.

Variance reduction study
In a nutshell, the robust framework based on the Wasserstein distance provides a better expected reward but at the expense of a higher computational cost. The risk measure based on the Kullback-Leibler divergence is easily tractable, provides a reduction of the variance in out-of-samples results, but a smaller increase in terms of expected reward (see Gotoh, Kim, and Lim (2018) for a more detailed theoretical analysis). In practice, as discussed in Shafieezadeh-Abadeh,  and in Gotoh, Kim, and Lim (2018), "a little of robustness" typically improves a bit the expected reward (around 1%), however results in a larger reduction in terms   of variance. We propose to reproduce such an analysis by means of two experiments using ionosphere dataset. We first consider the case when a small training set is used where only 10% of the data are available. Then we focus on the case when 60% of the data are used as training set. In both cases, 1000 random realizations are run, when we solve the classical formulation using logistic regression (LR) loss in Equation (2), the formulation in Problem 9 that uses ambiguity sets defined through the Kullback-Leibler divergence, and the formulation in Problem 12 that uses ambiguity sets defined through the Wasserstein distance. We then compute the value of the AUC metric on the associated testing set and display results as histograms. When we use 10% of the data for training ( Figure 5), we see the benefits of our robust solution with respect to the standard LR classifier. When more data are collected, the probability distribution becomes more accurate and our robust models tend to produce the same outputs as when using classical logistic

Conclusion
We have highlighted that risk measures offer versatile tools for addressing machine learning problems in a robust manner. By assuming that the loss function is convex, the related optimization problem has been recast as a convex one. We have shown that various classes of risk measures, e.g. those based on ϕ-divergences or on the Wasserstein distance, lead to a common convex formulation. In addition, an efficient convex optimization algorithm has been proposed to cope with the non trivial constrained problem resulting from this formulation. We have conducted numerical experiments in which various ambiguity sets are tackled thanks to the same algorithm. We have also illustrated that the considered robust models can outperform classical ones in challenging contexts when the size of the training set is limited, or when the distribution of labels in the training set is not representative of the reality.