STOCHASTIC AUC OPTIMIZATION WITH GENERAL LOSS

. Recently, there is considerable work on developing eﬃcient stochastic optimization algorithms for AUC maximization. However, most of them focus on the least square loss which may be not the best option in prac-tice. The main diﬃculty for dealing with the general convex loss is the pairwise nonlinearity w.r.t. the sampling distribution generating the data. In this paper, we use Bernstein polynomials to uniformly approximate the general losses which are able to decouple the pairwise nonlinearity. In particular, we show that this reduction for AUC maximization with a general loss is equivalent to a weakly convex (nonconvex) min-max formulation. Then, we develop a novel SGD algorithm for AUC maximization with per-iteration cost linearly w.r.t. the data dimension, making it amenable for streaming data analysis. Despite its non-convexity, we prove its global convergence by exploring the appealing convexity-preserving property of Bernstein polynomials and the intrinsic structure of the min-max formulation. Experiments are performed to validate the eﬀectiveness of the proposed approach.


1.
Introduction. Area under the ROC curve (AUC) [2,6,11,14] is a widely used metric for measuring classification performance in imbalanced classification and bipartite ranking. In imbalanced classification, the instances in one class are much more than the other class. Imbalanced data sets exist in many real-world domains such as fraud detection, information retrieval and medical diagnosis. It is of fundamental importance to develop efficient optimization algorithms for analyzing streaming data which is prevalent at the big data era.
There are considerable efforts on developing batch (offline) algorithms for AUC maximization, which use the entire data once, including the cutting plane algorithm [16] and gradient descent methods [3,15,35]. These algorithms have convergence rates of O min 1/ε, 1/ √ λε to achieve precision ε which, however, needs high periteration cost O(nd). Here, λ, n, d are the regularization parameter, the number of examples, and the dimension of the data, respectively. Such algorithms are not suitable for analyzing massive streaming data due to the expensive per-iteration cost.
Stochastic optimization algorithms such as SGD [1,18,19,28,29,30,34,33] are iterative and incremental in nature and process each new sample (input) with a computationally cheap update, making them amenable for large-scale streaming data analysis. However, most of existing studies focus on classification error where the objective function is linear w.r.t. the sampling distribution. This means that the expectation in the expected risk is taken w.r.t. a single data point. In contrast, the problem of AUC maximization involves the expectation of a pairwise loss function which depends on pairs of data points which makes the direct employment of standard SGD infeasible.
The studies [17,31,34,36] developed SGD or online gradient descent algorithms for AUC maximization. Such appealing algorithms do gradient descent based on gradient of the local error which compares the current example with all previous ones. As a result, one needs to access previous examples which leads to expensive space and per-iteration complexity of O(td) for d-dimensional data at iteration t. Although this problem is partially mitigated by the use of buffers with a fixed size B, this reduction is not necessarily an ideal approach. The work [12] followed the same approach and noticed that such algorithms for the least square loss only need to update the covariance matrix of the training data with per-iteration complexity O(d 2 ), which could be not scalable well to high-dimensional data.
The recent work [32,22,21] used the min-max reformulation of AUC maximization with the least square loss. The main idea is to reduce the double integral w.r.t. pairs of examples in the original objective function to a single integral w.r.t. an individual example by introducing auxiliary variables. However, one shortcoming of the above studies is that such methods depend critically on the structure of the least square loss and can not apply to the general losses such as logistic loss and hinge one. This largely limits the practical applications of AUC optimization algorithms since the least square loss is arguably not the best suitable loss in practice. The very recent work [20] considered AUC maximization with deep neural networks associated with the least square loss, which resutls in a nonconvex-concave minmax problem.
In this paper, we make efforts to develop novel SGD-type algorithms for AUC maximization with a general loss. In particular, we first propose to use Bernstein polynomials [25,26] to uniformly approximate the general loss. Then, we derive its equivalent (non-convex) min-max formulation which removes the pairwise structure in the original AUC objective function. We show that this non-convex min-max formulation is weakly convex [10,24] in the primal variables and develop novel SGD-type algorithms inspired by the recent work [27]. In contrast to the local convergence proved in [27], we are able to show that our novel algorithms enjoy the global convergence. The novel idea is the introduction of proximal terms only on partial primal variables instead of all of them in our algorithmic design, and an appealing relation between the original AUC objective function, which is convex due to the convexity-preserving property of Bernstein polynomials, and the duality gap arising from the special structure of the min-max formulation (see more discussions in Section 3).
2. AUC maximization with general loss and min-max reformulation. The AUC score [14] has an elegant probabilistic formulation. Specifically, suppose z = (x, y) and z = (x , y ) are independently drawn from an unknown (sampling) distribution P on Z = X × Y where Y = {±1}. Then, the AUC score is the probability of a positive sample ranking higher than a negative sample (e.g., [5,14]) which is given by where the expectation is w.r.t. (z, z ). Hence, maximizing AUC(w) is equivalent to minimizing 1 − AUC(w) which is given by It can be any convex loss such as the hinge loss (s) = (1 − s) + or logistic regression loss (s) = log(1 + e −s ). One can find appealing results in [13] on how statistical consistency is related to the choice of different loss functions. Now AUC maximization can be equivalently formulated as where the constant term 1 Pr(y=1)P (y =−1) in the original formulation is ignored. Initial motivation for using Bernstein polynomials. We consider the (stochactic) online setting where individual data points z = (x, y) are i.i.d from the distribution P. The main difficulty for developing AUC optimization algorithms for streaming data is that the population (expected) risk in (2.2) depends on pairs of examples (z, z ) which are statistically dependent as pairs of examples may share one common individual example. The work [32] showed, for the least square loss, that the original problem (2.2) is equivalent to a convex-concave (saddle) point problem [23] where the new objective function depends on only one individual example.
Following the same spirit, since the least square loss is a polynomial function of degree 2, one would naturally think of approximating the general loss by high-order polynomial functions and then expect an equivalently saddle point reformulation. One plain idea is to use m-th order Taylor polynomials (Taylor series) to approximate which, however, is not convex even if is convex. Instead, we propose to use the Bernstein polynomials (e.g., [25,26]), useful tools from approximation theory, to uniformly approximate a convex loss function. Specifically, the Bernstein polynomial of degree m for a function ϕ : [0,1] → R are defined, for any u ∈ [0,1], by where m k denotes the binomial coefficients and ∆ k ϕ(0) = k j=0 (−1) k−j k j ϕ( j m ) is the forward difference operator on ϕ at 0. If ϕ is Lipschitz continuous, then B m (ϕ; ·) converges uniformly to ϕ with a rate of O( 1 √ m ), and the rate is O( 1 m ) if ϕ has Lipschitz continuous gradient (See a self-contained proof in Part D of the Appendix). More importantly, Bernstein polynomials are convexity-preserving, i.e. if ϕ is convex then B m (ϕ; ·) is convex which is critical for deriving the global convergence of our proposed algorithm later.
To approximate the loss defined on the general interval, we assume D = sup x∈X x < ∞ and thus s := w x − w x satisfies |s| ≤ 2RD := L for any w, x and x as w ≤ R. Now by changing variables u = L+s 2L (i.e. s = L(2u − 1)), the loss induces a function on the unit interval [0, 1] by letting ϕ(u) = (s) for any s ∈ [−L, L]. Consequently, there holds which is convex due to the convexity-preserving property of Bernstein polynomials [25,26].
2.1. Min-max formulation. Here, we show that the AUC maximization (2.5) is equivalent to a (non-convex) min-max formulation, and discuss some properties of its objective function. For the simplicity of notation, denote e In addition, we define, where the expectation E z [·] is taken w.r.t. z = (x, y).
Proof. Notice, since (x, y) and (x , y ) are independent, that the expectation terms in (2.5) can be written by Thus, the objective function in (2.5) can be rewritten as Notice that This means that, for every w, 8) and the optima are achieved at This completes the proof of the theorem.
Properties of the min-max formulation. We discuss useful properties of the min-max formulation and the function F . Firstly, we can show that u = (v, α) = (w, a, b, α) in formulation (2.7) can all be restricted to a bounded domain. To see this, notice from (2.8), (2.9) that any optimal point (v * , α * ) satisfies a * = a(w * ) = . Therefore, by the definitions of E + and E − and noting that |w x| ≤ w x ≤ RD = L 2 , we have Therefore, without loss of generality, the variables (w, a, b, α) in formulation (2.7) can be restricted to the bounded set v ∈ Secondly, it is easy to see that the involved function F (v, α; z) is not convex with respect to v = (w, a, b) and is strongly concave with respect to α. Hence the min-max formulation (2.7) is not a standard convex-concave saddle point problem [23]. However, one can show that F is ρ-weakly convex on v for some ρ > 0, i.e. F (v, α; z) + ρ 2 v 2 is convex on v for any α and z [24,8] (See its proof in Part A of the Appendix). Perhaps most importantly, one can further show that adding a partial regularization term w 2 to F (v, α; z), instead of the square norm of all primal variables v 2 , will play the same convexity-inducing effect, i.e. F (v, α; z) + γ 2 w 2 is convex w.r.t. v for a sufficient large γ > 0. To show this, let us introduce some notations

ZHENHUAN YANG, WEI SHEN, YIMING YING AND XIAOMING YUAN
and consider As we will see in the next section, this proposition plays a key role in designing an SGD-type algorithm which enjoys the global convergence (see the proof in Part B of the Appendix).
3. Algorithm and convergence analysis. In this section, we propose a stochastic optimization algorithm for our novel min-max formulation (2.7). Our proposed algorithm called SAUC is described in Algorithm 1 which is inspired by the recent work [27]. In particular, the appealing work [27] studied a family of non-convex min-max problems where the minimization component is weakly convex and the maximization component is concave, which is motivated by the studies on weakly convex minimization problems [8,7].

5:
for j = 1 to t do 6: Our algorithm SAUC has a cheap storage and per-iteration cost. Indeed, at line 6, the samples {z t j = (x t j , y t j )} are i.i.d from the distribution P on X × Y. As such, one only need to store the current sample with space O(d), which is linear w.r.t. the data dimension. It makes SAUC the first truly online algorithm for AUC maximization with general loss. The main per-iteration cost comes from lines 6 and 7, which is the standard SGD-type algorithm by [23] for solving the standard (convex-concave) stochastic saddle point problem: Note that the projections at line 6 can be easily computed since Ω 1 and Ω 2 are bounded 2 -balls. The computation of ∇ v Φ t γ (involving the computation of ∇f i (w; x) and ∇f i (w; x) etc.) only needs to compute and save the inner product where the m can be ignored for large d.
Compared with the work [27], the main difference of our algorithm from that in [27] is that we propose to use the proximal term γ This simple design is important to prove the convergence of the global convergence of Algorithm 1. To be more specific, we can interpret the result in [27] under our setting of AUC maximization as follows. Let The work [27] transformed a constrained optimization problem, namely min v∈Ω1 ψ(v) to an unconstrained, however non-smooth optimization prob- Then for τ chosen uniformly from {0, · · · , T − 1}, the following result, with appropriately choosing stepsizes, was Note that this only means the local convergence ofv t and it is difficult to show the global conver- The global convergence result for Algorithm 1 is stated as follows. Let w * := argmin w ≤R f (w) to be an optimal solution of the original AUC maximization problem (2.5).
The proof of Theorem 3.1 requires the following lemma. It shows the convergence of the SGD-type algorithm for solving the saddle point problem (3.1) (the inner loop from line 5 to line 8 in Algorithm 1 with fixed t in the outer loop). Recall that ϕ t γ is defined by (3.1) and define the duality gap atū t : where C 1 is an absolute constant independent of t (see its explicit expression in the proof ).
The proof of Lemma 3.2 is standard [23]. A self-contained proof can be found in Part C of the Appendix. To prove Theorem 3.1, we further define, for anyw t , that w t = argmin w ≤R {f (w) + γ 2 w −w t 2 }. Now we are ready to prove Theorem 3.1.
Proof of Theorem 3.1. We first estimate f (w t ) − f (w * ). For this purpose, notice the convexity of f (w) indicates and the optimality condition ofŵ t implies Combining these two inequalities implies that By convexity and Cauchy-Schwarz inequality, the above estimation indicates wherein the first inequality follows the optimality condition ofŵ t−1 and the γstrong convexity of f (w) + γ 2 w −w t−1 2 for w ≤ R, while the second inequality can be derived as follows. Indeed, from (2.8) in the proof of Theorem 2.1 we know that f (w t ) = min a,b max α φ(w t , a, b, α) which implies that In addition, by (2.8) again To further bound (3.5), we need the following elementary inequalities: Combining this with (3.5) implies that where the last inequality used the fact γ 2 ŵ t−1 −w t 2 ≤ ε(ū t ) from (3.5). We also notice, by the definition ofŵ t−1 , that Combining the above two estimations, we get Adding the inequalities from t = 1 to t = T , we have Combining this with (3.4) and taking expectation on both sides imply that Combining it with Lemma 3.2 completes the proof with C m = 30(L 1 +2Rγ) 2 (L 1 R+ C 1 )γ −1 .
Trading off the approximation error of Bernstein polynomials and Theorem 3.1 yields the following final convergence rate for the original objective function g of AUC maximization defined by (2.2).
where C and B are constants depending on G, R, D and β but independent of m and T .
The proof of Theorem 3.3 is given in Part D of the Supplementray Material. From the above theorem, we can choose m = log C (T 1/4 / log T ) = O(log T ) which yields the final convergence rate of O( 1/ log T ). However, this analysis and convergence rate is not satisfactory. It remains a challenging problem to us on how to derive a desirable final convergence rate.

Experiments.
In this section, we compare our proposed algorithm SAUC against existing AUC optimization algorithms. In particular, SAUC-H and SAUC-L denotes SAUC with the hinge loss and the logistic loss, respectively. All experiments were implemented in Python 3 with 16 × 3.0GHz CPUs and 128GB memory.
We conducted our experiments on 9 benchmark datasets which are downloaded from the LIBSVM [4] and UCI machine learning repository [9]. Multi-class datasets have been converted into binary-class by randomly partitioning classes into positive and negative groups. All data have been normalized with unit 2 -norm. Information about these datasets is summarized in Table 1.

Table 1. Statistics of datasets
Generalization performance: We compare SAUC with following state-of-theart online learning algorithms for AUC optimization: 1) Online AUC Maximization(OAM) [36] with focus on OAM gra assocaited with the hinge loss. The buffer sizes N + , N − for positive and negative classes are set as 100 as in the original paper; 2) One-Pass AUC Maximization(OPAUC) [12] optimizes square loss with 2 regularizing term. 3) Stochastic Online AUC Maximization(SOLAM) [32] optimizes square loss on bounded domain; 4) Stochastic Proximal Algorithm for AUC Maximization(SPAM) [22] optimizes square loss with 2 regularizer; 5) Fast Stochastic AUC Maximization(FSAUC) [21] optimizes square loss on bounded 1 domain. The probability parameter δ is set as 0.1 as in the codes author provided. All the algorithms with 2 regularizer is converted to 2 -norm constraint for the sake of fair comparison.
In the training phase of each algorithms, we use 5-fold cross validation to determine the bound radius R ∈ 10 [−2:2] and the learning rate parameter β ∈ 10 [−2 :2] . The proximal parameter γ is chosen as γ 0 as given in (2.10). Throughout our experiments, the degree of the Bernstein polynomials m is chosen to be 10. The performance of each algorithms is evaluated by averaging results from 10 epochs of 5-fold cross validations.
Testing performances of all methods are summarized in Table 2. These results show that SAUC achieves similar or competitive performances as other state-ofthe-art online or stochastic methods based on AUC maximization. In particular, SAUC-H performs similarly as OAM gra on all datasets. In addition, on datasets australian, leu, sector, skin nonskin and svmguide1, the hinge loss based algorithm SAUC-H outperforms square loss based algorithms. This suggests that the different losses (hinge loss and logistic loss) may be more suitable for these datasets.
Convergence speed: We compare convergence versus CPU time (in seconds) of SAUC-H and OAM gra on datasets dna, news20 and sector. The reason for choosing these two algorithms is because they both maximize AUC under the hinge loss. The results are summarized in Figure 1. These results show that SAUC-H converges much faster than OAM gra when the data feature dimension d is high. One important reason may be that SAUC only needs use O(d) memory and per-iteration cost while OAM needs O(Bd), where B is the buffer size. Table 2. Comparison of AUC score (mean±std) on test data; OPAUC on news20 and sector does not converge in a reasonable time limit. Best AUC value on each dataset is in bold and second is underlined.    Sensitivity of the degree of the Bernstein polynomial: Here we investigate the sensitivity of the degree m of Bernstein polynomials to the empirical performance. Figure 2 evaluates the AUC scores of SAUC-H with varied Bernstein polynomial degrees m on datasets dna, news20 and sector.
First of all, we find that on sector, when the Bernstein polynomial degree is too small (e.g. m = 2 or m = 5), SAUC achieves lowest AUC scores. This matches the intuition that the Bernstein polynomial approximates the original loss badly. Interestingly, we also find that on dna and news20 when m = 2, the approximated loss becomes a square loss and the performance is improved, which coincides with the results in Table 2. Finally, we find that when m is large enough, the AUC scores tend to become saturated. This is consistent with theoretically optimal choice of m is O(log T ) from Theorem 3.3 and a larger degree does not necessarily lead to better performance.

5.
Conclusion. In this paper we proposed to use the Bernstein polynomials to uniformly approximate a general convex loss, and then showed that AUC maximization is equivalent to a (non-convex) weakly convex saddle point (min-max) problem. From this equivalent formulation, we proposed a novel SGD-type AUC optimization algorithm for streaming data analysis. Although the min-max formulation is non-convex, we showed that the proposed algorithm still enjoys the global convergence through sufficiently exploring the intrinsic structure of the min-max formulation and the convex-preserving property of Bernstein polynomials. Finally, we performed experiments to validate effectiveness of the proposed algorithm.
There are several directions for future research. Firstly, the decomposition of f (w) into φ(v, α) is not unique. It would be interesting to find other possible ones and investigate their theoretical differences and the resulting algorithmic performances in practice. Secondly, the choice of parameter γ 0 in (2.10) for SAUC is potentially large if m or R is large. It would be interesting if we can choose the γ 0 adapatively using some strategies of line search. Another closely related question is the final convergence rate for the original objective function (2.2) of AUC maximization is not desirable as the estimation of the constant C m in terms of γ 0 and R is complicated. It remians unclear to us how to derive a fast final convergence rate.
Acknowledgement. The authors would like to thank the reviewers for their constructive comments and suggestions.

ZHENHUAN YANG, WEI SHEN, YIMING YING AND XIAOMING YUAN
Then we write the Hessian matrix of Φ γ (v, α; z) w.r.t. v as where I d is a d × d identity matrix and I m+1 is an (m + 1) × (m + 1) identity matrix. For simplicity, we study two cases y = 1 and y = −1 separately. On one hand, if y = 1, we have . Thus we only need H + 0. Next we will estimate the lower bound of the minimal eigenvalue of H + . Firstly, we know ∇ 2 w F has only one none-zero eigenvalue, i.e. . Secondly, we write (∇ w e + )(∇ w e + ) = m i=0 (∇ w e + i )(∇ w e + i ) and we know (∇ w e + i )(∇ w e + i ) has only one none-zero eigenvalue, i.e. ∇ w e + i 2 . Thus the maximal eigenvalue of (∇ w e + )(∇ w e + ) is upper bounded by m i=0 ∇ w e + i 2 = ∇ w e + 2 ≤ D 2 (S + 1 ) 2 . Finally, we have the minimal eigenvalue of H + is lower bounded by γ − . This means if y = 1, we can choose Similarly, on the other hand, if y = −1, we can choose As the deduction is almost the same as the previous case, we omit the details here. In conclusion, we have if the parameter γ ≥ γ 0 , where C. Proof of Lemma 3.2. The proof of Lemma 3.2 needs the following elementary lemma.
Proof. By the optimality condition of ω j+1 , for any ω ∈ Ω, we have We decompose left hand side of (C.1) like that To further decompose the right hand side of (C.1), we need the following perfect square formulation ω − ω j Thus we have where the last inequality used the fact (ω j − ω j+1 ) g ≤ η 2 g 2 + 1 2η ω j − ω j+1 2 . Thus we have proved this lemma.
We are ready to prove Lemma 3.2.
Proof of Lemma 3.2. By the convexity of ϕ t γ (·, α), we have, for all v ∈ Ω 1 , that . Similarly by the concavity of ϕ t γ (v, ·), we have, for all α ∈ Ω 2 , that )). With this notation, the (j + 1)-th update step in the t-th inner loop of Algorithm 1 can be written as Similarly, we define an auxiliary sequence as implies that Adding up (C.3), (C.4) and (C.5), we have Again, by the convexity and concavity of ϕ t γ , we obtain Combining this with (C.6) implies that ∀u ∈ Ω 1 × Ω 2 , Now taking supreme in u ∈ Ω 1 × Ω 2 on both sides of (C.7) implies where the last inequality used the bounds of v and α. We next show how to uniformly bound G t j+1 2 and ∆ t j+1 2 . For the simplicity of notation, we write and g (t,j+1) α as G v , G α , g v and g α respectively. We also simply denote e Combining these with the facts Moreover we have Then replacing e + and e − with E + and E − respectively and using almost the same deductions as the above, we have Taking expectation on both sides of (C.9), we conclude that Then by choosing η t = β/ √ t and denoting C 1 := 4β −1 [R 2 +R 2 1 +R 2 2 +(R 1 +R 2 ) 2 ]+ 5β(M 2 1 + M 2 2 )/2, we obtain the desired result.
D. Final convergence rate. In this subsection, we investigate the final convergence of the output w T of SAUC on the original objective function g of AUC maximization given by (2.2) through theoretically optimal choice of the degree m of Bernstein polynomials. The proof of Theorem 3.3 requires the following proposition on the approximation error of Bernstein polynomials [25,26]. To evaluate C m , we will need the following quantities. Firstly, the k-th order forward difference is upper bounded by where we use the fact that (a + b) k = k j=0 k j a j b k−j . Secondly, the bound on a and b can be upper bounded by