# American Institute of Mathematical Sciences

• Previous Article
Approximation by multivariate max-product Kantorovich-type operators and learning rates of least-squares regularized regression
• CPAA Home
• This Issue
• Next Article
A numerical method to compute Fisher information for a special case of heterogeneous negative binomial regression
August  2020, 19(8): 4191-4212. doi: 10.3934/cpaa.2020188

## Stochastic AUC optimization with general loss

 1 Department of Mathematics and Statistics, State University of New York at Albany, Albany, NY 12206, USA 2 Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Kowloon, Hong Kong, China 3 Department of Mathematics, The University of Hong Kong, Hong Kong, China

* Corresponding author

Received  October 2019 Revised  January 2020 Published  May 2020

Fund Project: This work was completed when Wei Shen was a visiting student at SUNY Albany. Yiming Ying is supported by the National Science Foundation (NSF, Grant IIS1816227)

Recently, there is considerable work on developing efficient stochastic optimization algorithms for AUC maximization. However, most of them focus on the least square loss which may be not the best option in practice. The main difficulty for dealing with the general convex loss is the pairwise nonlinearity w.r.t. the sampling distribution generating the data. In this paper, we use Bernstein polynomials to uniformly approximate the general losses which are able to decouple the pairwise nonlinearity. In particular, we show that this reduction for AUC maximization with a general loss is equivalent to a weakly convex (nonconvex) min-max formulation. Then, we develop a novel SGD algorithm for AUC maximization with per-iteration cost linearly w.r.t. the data dimension, making it amenable for streaming data analysis. Despite its non-convexity, we prove its global convergence by exploring the appealing convexity-preserving property of Bernstein polynomials and the intrinsic structure of the min-max formulation. Experiments are performed to validate the effectiveness of the proposed approach.

Citation: Zhenhuan Yang, Wei Shen, Yiming Ying, Xiaoming Yuan. Stochastic AUC optimization with general loss. Communications on Pure & Applied Analysis, 2020, 19 (8) : 4191-4212. doi: 10.3934/cpaa.2020188
##### References:
 [1] F. Bach and E. Moulines, Non-strongly-convex smooth stochastic approximation with convergence rate O (1/n), in Advances in Neural Information Processing Systems, (2013), 773–781. Google Scholar [2] A. P. Bradley, The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recognit., 30 (1997), 1145-1159.  doi: 10.1016/S0031-3203(96)00142-2.  Google Scholar [3] T. Calders and S. Jaroszewicz, Efficient AUC optimization for classification, in PKDD, Vol. 4702, Springer, (2007), 42–53. Google Scholar [4] C. C. Chang and C. J. Lin, LIBSVM: a library for support vector machines, ACM Trans. Intell. Syst. Technol., 2 (2011), 21 pp. doi: 10.1145/1961189.1961199.  Google Scholar [5] S. Clémençon, G. Lugosi and N. Vayatis, Ranking and empirical minimization of U-statistics, Ann. Statist., 36 (2008), 844-874.  doi: 10.1214/009052607000000910.  Google Scholar [6] C. Cortes and M. Mohri, AUCoptimization vs. error rate minimization, in Advances in Neural Information Processing Systems, (2004), 313–320. Google Scholar [7] D. Davis and D. Drusvyatskiy, Stochastic model-based minimization of weakly convex functions, SIAM J. Optim., 29 (2019), 207-239.  doi: 10.1137/18M1178244.  Google Scholar [8] D. Davis and B. Grimmer, Proximally Guided Stochastic Subgradient Method for Nonsmooth, Nonconvex Problems, SIAM J. Optim., 29 (2019), 1908-1930.  doi: 10.1137/17M1151031.  Google Scholar [9] Dheeru, Dua and Karra Taniskidou, Efi, UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences, 2017. Available from: http://archive.ics.uci.edu/ml. Google Scholar [10] D. Drusvyatskiy, The proximal point method revisited, preprint, arXiv: 1712.06038. Google Scholar [11] T. Fawcett, An introduction to ROC analysis, Pattern Recognit. Lett., 27 (2006), 861-874.   Google Scholar [12] W. Gao, R. Jin, S. Zhu and Z. H. Zhou, One-pass AUC optimization, in International Conference on Machine Learning, (2013), 906–914. Google Scholar [13] W. Gao and Z. H. Zhou, On the Consistency of AUC Pairwise Optimization, in IJCAI, (2015), 939–945. Google Scholar [14] J. A. Hanley and B. J. McNeil, The meaning and use of the area under a receiver operating characteristic (ROC) curve, Radiology, 143 (1982), 29-36.   Google Scholar [15] A. Herschtal and B. Raskutti, Optimising area under the ROC curve using gradient descent, in Proceedings of the 21st International Conference on Machine Learning, ACM, (2004), 49. Google Scholar [16] T. Joachims, A support vector method for multivariate performance measures, in Proceedings of the 22nd International Conference on Machine Learning, ACM, (2005), 377–384. Google Scholar [17] P. Kar, B. Sriperumbudur, P. Jain and H. Karnick, On the generalization ability of online learning algorithms for pairwise loss functions, in International Conference on Machine Learning, (2013), 441–449. Google Scholar [18] S. Lacoste-Julien, M. Schmidt and F. Bach, A simpler approach to obtaining an O (1/t) convergence rate for the projected stochastic subgradient method, preprint, arXiv: 1212.2002. doi: 10.1137/1.9781611974331.ch127.  Google Scholar [19] J. Lin and L. Rosasco, Optimal learning for multi-pass stochastic gradient methods, in Advances in Neural Information Processing Systems, (2016), 4556–4564.  Google Scholar [20] M. Liu, Z. Yuan, Y. Ying and T. Yang, Stochastic AUC Maximization with Deep Neural Networks, in International Conference on Learning Representations (ICLR), 2020. Google Scholar [21] M. Liu, X. Zhang, Z. Chen, X. Wang and T. Yang, Fast stochastic AUC maximization with O (1/n)-convergence rate, in International Conference on Machine Learning, (2018), 3195–3203. Google Scholar [22] M. Natole, Y. Ying and S. Lyu, Stochastic proximal algorithms for AUC maximization, in International Conference on Machine Learning, (2018), 3707–3716. Google Scholar [23] A. Nemirovski, A. Juditsky, G. Lan and A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM J. Optim., 19 (2009), 1574-1609.  doi: 10.1137/070704277.  Google Scholar [24] E. A. Nurminskii, The quasigradient method for the solving of the nonlinear programming problems, Cybernetics, 9 (1973), 145-150.   Google Scholar [25] G. M. Phillips, Interpolation and Approximation by Polynomials, Vol. 14, Springer Science & Business Media, 2003. doi: 10.1007/b97417.  Google Scholar [26] M. J. D. Powell, Approximation Theory and Methods, Cambridge University Press, 1981.   Google Scholar [27] H. Rafique, M. Liu, Q. Lin and T. Yang, Non-Convex Min-Max Optimization: Provable Algorithms and Applications in Machine Learning, preprint, arXiv: 1810.02060. Google Scholar [28] A. Rakhlin, O. Shamir and K. Sridharan, Making gradient descent optimal for strongly convex stochastic optimization, in Proceedings of the 29th International Conference on Machine Learning, (2012), 449–456. Google Scholar [29] O. Shamir and T. Zhang, Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes, in International Conference on Machine Learning, (2013), 71–79. Google Scholar [30] N. Srebro, A. Tewari, Stochastic optimization for machine learning, CML Tutorial, (2010). Google Scholar [31] Y. Wang, R. Khardon, D. Pechyony and R. Jones, Generalization bounds for online learning algorithms with pairwise loss functions, in Conference on Learning Theory, (2012), 13. Google Scholar [32] Y. Ying, L. Wen and S. Lyu, Stochastic online AUC maximization, in Advances in Neural Information Processing Systems, 2016. Google Scholar [33] Y. Ying and D. X. Zhou, Online regularized classification algorithms, IEEE Trans. Inform. Theory, 52 (2006), 4775-4788.  doi: 10.1109/TIT.2006.883632.  Google Scholar [34] Y. Ying and D. X. Zhou, Online pairwise learning algorithms, Neural Comput., 28 (2016), 743-777.  doi: 10.1162/neco_a_00817.  Google Scholar [35] X. Zhang, A. Saha and S. V. N. Vishwanathan, Smoothing multivariate performance measures, J. Mach. Learn. Res., 13 (2012), 3623-3680.   Google Scholar [36] P. Zhao, R. Jin, T. Yang and S. C. Hoi, Online AUC maximization, in Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011. Google Scholar

show all references

##### References:
 [1] F. Bach and E. Moulines, Non-strongly-convex smooth stochastic approximation with convergence rate O (1/n), in Advances in Neural Information Processing Systems, (2013), 773–781. Google Scholar [2] A. P. Bradley, The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recognit., 30 (1997), 1145-1159.  doi: 10.1016/S0031-3203(96)00142-2.  Google Scholar [3] T. Calders and S. Jaroszewicz, Efficient AUC optimization for classification, in PKDD, Vol. 4702, Springer, (2007), 42–53. Google Scholar [4] C. C. Chang and C. J. Lin, LIBSVM: a library for support vector machines, ACM Trans. Intell. Syst. Technol., 2 (2011), 21 pp. doi: 10.1145/1961189.1961199.  Google Scholar [5] S. Clémençon, G. Lugosi and N. Vayatis, Ranking and empirical minimization of U-statistics, Ann. Statist., 36 (2008), 844-874.  doi: 10.1214/009052607000000910.  Google Scholar [6] C. Cortes and M. Mohri, AUCoptimization vs. error rate minimization, in Advances in Neural Information Processing Systems, (2004), 313–320. Google Scholar [7] D. Davis and D. Drusvyatskiy, Stochastic model-based minimization of weakly convex functions, SIAM J. Optim., 29 (2019), 207-239.  doi: 10.1137/18M1178244.  Google Scholar [8] D. Davis and B. Grimmer, Proximally Guided Stochastic Subgradient Method for Nonsmooth, Nonconvex Problems, SIAM J. Optim., 29 (2019), 1908-1930.  doi: 10.1137/17M1151031.  Google Scholar [9] Dheeru, Dua and Karra Taniskidou, Efi, UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences, 2017. Available from: http://archive.ics.uci.edu/ml. Google Scholar [10] D. Drusvyatskiy, The proximal point method revisited, preprint, arXiv: 1712.06038. Google Scholar [11] T. Fawcett, An introduction to ROC analysis, Pattern Recognit. Lett., 27 (2006), 861-874.   Google Scholar [12] W. Gao, R. Jin, S. Zhu and Z. H. Zhou, One-pass AUC optimization, in International Conference on Machine Learning, (2013), 906–914. Google Scholar [13] W. Gao and Z. H. Zhou, On the Consistency of AUC Pairwise Optimization, in IJCAI, (2015), 939–945. Google Scholar [14] J. A. Hanley and B. J. McNeil, The meaning and use of the area under a receiver operating characteristic (ROC) curve, Radiology, 143 (1982), 29-36.   Google Scholar [15] A. Herschtal and B. Raskutti, Optimising area under the ROC curve using gradient descent, in Proceedings of the 21st International Conference on Machine Learning, ACM, (2004), 49. Google Scholar [16] T. Joachims, A support vector method for multivariate performance measures, in Proceedings of the 22nd International Conference on Machine Learning, ACM, (2005), 377–384. Google Scholar [17] P. Kar, B. Sriperumbudur, P. Jain and H. Karnick, On the generalization ability of online learning algorithms for pairwise loss functions, in International Conference on Machine Learning, (2013), 441–449. Google Scholar [18] S. Lacoste-Julien, M. Schmidt and F. Bach, A simpler approach to obtaining an O (1/t) convergence rate for the projected stochastic subgradient method, preprint, arXiv: 1212.2002. doi: 10.1137/1.9781611974331.ch127.  Google Scholar [19] J. Lin and L. Rosasco, Optimal learning for multi-pass stochastic gradient methods, in Advances in Neural Information Processing Systems, (2016), 4556–4564.  Google Scholar [20] M. Liu, Z. Yuan, Y. Ying and T. Yang, Stochastic AUC Maximization with Deep Neural Networks, in International Conference on Learning Representations (ICLR), 2020. Google Scholar [21] M. Liu, X. Zhang, Z. Chen, X. Wang and T. Yang, Fast stochastic AUC maximization with O (1/n)-convergence rate, in International Conference on Machine Learning, (2018), 3195–3203. Google Scholar [22] M. Natole, Y. Ying and S. Lyu, Stochastic proximal algorithms for AUC maximization, in International Conference on Machine Learning, (2018), 3707–3716. Google Scholar [23] A. Nemirovski, A. Juditsky, G. Lan and A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM J. Optim., 19 (2009), 1574-1609.  doi: 10.1137/070704277.  Google Scholar [24] E. A. Nurminskii, The quasigradient method for the solving of the nonlinear programming problems, Cybernetics, 9 (1973), 145-150.   Google Scholar [25] G. M. Phillips, Interpolation and Approximation by Polynomials, Vol. 14, Springer Science & Business Media, 2003. doi: 10.1007/b97417.  Google Scholar [26] M. J. D. Powell, Approximation Theory and Methods, Cambridge University Press, 1981.   Google Scholar [27] H. Rafique, M. Liu, Q. Lin and T. Yang, Non-Convex Min-Max Optimization: Provable Algorithms and Applications in Machine Learning, preprint, arXiv: 1810.02060. Google Scholar [28] A. Rakhlin, O. Shamir and K. Sridharan, Making gradient descent optimal for strongly convex stochastic optimization, in Proceedings of the 29th International Conference on Machine Learning, (2012), 449–456. Google Scholar [29] O. Shamir and T. Zhang, Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes, in International Conference on Machine Learning, (2013), 71–79. Google Scholar [30] N. Srebro, A. Tewari, Stochastic optimization for machine learning, CML Tutorial, (2010). Google Scholar [31] Y. Wang, R. Khardon, D. Pechyony and R. Jones, Generalization bounds for online learning algorithms with pairwise loss functions, in Conference on Learning Theory, (2012), 13. Google Scholar [32] Y. Ying, L. Wen and S. Lyu, Stochastic online AUC maximization, in Advances in Neural Information Processing Systems, 2016. Google Scholar [33] Y. Ying and D. X. Zhou, Online regularized classification algorithms, IEEE Trans. Inform. Theory, 52 (2006), 4775-4788.  doi: 10.1109/TIT.2006.883632.  Google Scholar [34] Y. Ying and D. X. Zhou, Online pairwise learning algorithms, Neural Comput., 28 (2016), 743-777.  doi: 10.1162/neco_a_00817.  Google Scholar [35] X. Zhang, A. Saha and S. V. N. Vishwanathan, Smoothing multivariate performance measures, J. Mach. Learn. Res., 13 (2012), 3623-3680.   Google Scholar [36] P. Zhao, R. Jin, T. Yang and S. C. Hoi, Online AUC maximization, in Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011. Google Scholar
Comparison of convergence speed between SAUC-H and $\text{OAM}_{gra}$
Evaluation of AUC scores vesus the degree of the Bernstein polynomial
 Algorithm 1: Stochastic AUC Optimization (SAUC) 1: Input: $R>0$, $\gamma\geq\gamma_0$ and $\beta>0$. 2: Initialize $\bar{{\mathbf{v}}}_0 = 0$ and $\bar{{\mathit{\boldsymbol{\alpha}}}}_0 = 0$. 3: for $t=1$ to $T-1$ do4: Set ${\mathbf{v}}_0^t = \bar{{\mathbf{v}}}_{t-1}, {\mathit{\boldsymbol{\alpha}}}_0^t = \bar{{\mathit{\boldsymbol{\alpha}}}}_{t-1}$ and $\eta_t = \frac{\beta}{\sqrt{t}}.$ 5: for $j=1$ to $t$ do 6: Randomly sample $z_j^t = (x_j^t,y_j^t)$ and compute \begin{align*} &{\mathbf{v}}_{j}^t = {{\bf Proj}}_{{\Omega}_1} \bigl({\mathbf{v}}_{j-1}^t - \eta_t \nabla_{{\mathbf{v}}} \varPhi_{\gamma}^t({\mathbf{v}}_{j-1}^t,{\mathit{\boldsymbol{\alpha}}}_{j-1}^t;z_j^t)\bigr), &{\mathit{\boldsymbol{\alpha}}}_{j}^t = {{\bf Proj}}_{{\Omega}_2} \bigl({\mathit{\boldsymbol{\alpha}}}_{j-1}^t + \eta_t \nabla_{{\mathit{\boldsymbol{\alpha}}}} \varPhi_{\gamma}^t({\mathbf{v}}_{j-1}^t,{\mathit{\boldsymbol{\alpha}}}_{j-1}^t;z_j^t)\bigr) \end{align*} 7: end for 8: Compute $\bar{{\mathbf{v}}}_{t} = \frac{1}{t}\sum_{j=0}^{t-1} {\mathbf{v}}_j^t$ and $\bar{{\mathit{\boldsymbol{\alpha}}}}_{t} = \frac{1}{t}\sum_{j=0}^{t-1} {\mathit{\boldsymbol{\alpha}}}_j^t.$9: end for 10: Output: $\widetilde{{\mathbf{v}}}_T:=\frac{1}{T}\sum_{t=0}^{T-1}\bar{{\mathbf{v}}}_{t}$ and $\widetilde{{\mathit{\boldsymbol{\alpha}}}}_T:=\frac{1}{T}\sum_{t=0}^{T-1}\bar{{\mathit{\boldsymbol{\alpha}}}}_{t}.$
 Algorithm 1: Stochastic AUC Optimization (SAUC) 1: Input: $R>0$, $\gamma\geq\gamma_0$ and $\beta>0$. 2: Initialize $\bar{{\mathbf{v}}}_0 = 0$ and $\bar{{\mathit{\boldsymbol{\alpha}}}}_0 = 0$. 3: for $t=1$ to $T-1$ do4: Set ${\mathbf{v}}_0^t = \bar{{\mathbf{v}}}_{t-1}, {\mathit{\boldsymbol{\alpha}}}_0^t = \bar{{\mathit{\boldsymbol{\alpha}}}}_{t-1}$ and $\eta_t = \frac{\beta}{\sqrt{t}}.$ 5: for $j=1$ to $t$ do 6: Randomly sample $z_j^t = (x_j^t,y_j^t)$ and compute \begin{align*} &{\mathbf{v}}_{j}^t = {{\bf Proj}}_{{\Omega}_1} \bigl({\mathbf{v}}_{j-1}^t - \eta_t \nabla_{{\mathbf{v}}} \varPhi_{\gamma}^t({\mathbf{v}}_{j-1}^t,{\mathit{\boldsymbol{\alpha}}}_{j-1}^t;z_j^t)\bigr), &{\mathit{\boldsymbol{\alpha}}}_{j}^t = {{\bf Proj}}_{{\Omega}_2} \bigl({\mathit{\boldsymbol{\alpha}}}_{j-1}^t + \eta_t \nabla_{{\mathit{\boldsymbol{\alpha}}}} \varPhi_{\gamma}^t({\mathbf{v}}_{j-1}^t,{\mathit{\boldsymbol{\alpha}}}_{j-1}^t;z_j^t)\bigr) \end{align*} 7: end for 8: Compute $\bar{{\mathbf{v}}}_{t} = \frac{1}{t}\sum_{j=0}^{t-1} {\mathbf{v}}_j^t$ and $\bar{{\mathit{\boldsymbol{\alpha}}}}_{t} = \frac{1}{t}\sum_{j=0}^{t-1} {\mathit{\boldsymbol{\alpha}}}_j^t.$9: end for 10: Output: $\widetilde{{\mathbf{v}}}_T:=\frac{1}{T}\sum_{t=0}^{T-1}\bar{{\mathbf{v}}}_{t}$ and $\widetilde{{\mathit{\boldsymbol{\alpha}}}}_T:=\frac{1}{T}\sum_{t=0}^{T-1}\bar{{\mathit{\boldsymbol{\alpha}}}}_{t}.$
Statistics of datasets
Comparison of AUC score (mean$\pm$std) on test data; OPAUC on news20 and sector does not converge in a reasonable time limit. Best AUC value on each dataset is in bold and second is underlined
 [1] Predrag S. Stanimirović, Branislav Ivanov, Haifeng Ma, Dijana Mosić. A survey of gradient methods for solving nonlinear optimization. Electronic Research Archive, 2020, 28 (4) : 1573-1624. doi: 10.3934/era.2020115 [2] Alberto Bressan, Sondre Tesdal Galtung. A 2-dimensional shape optimization problem for tree branches. Networks & Heterogeneous Media, 2020  doi: 10.3934/nhm.2020031 [3] Aihua Fan, Jörg Schmeling, Weixiao Shen. $L^\infty$-estimation of generalized Thue-Morse trigonometric polynomials and ergodic maximization. Discrete & Continuous Dynamical Systems - A, 2021, 41 (1) : 297-327. doi: 10.3934/dcds.2020363 [4] Zhenzhen Wang, Tianshou Zhou. Asymptotic behaviors and stochastic traveling waves in stochastic Fisher-KPP equations. Discrete & Continuous Dynamical Systems - B, 2020  doi: 10.3934/dcdsb.2020323 [5] Lorenzo Zambotti. A brief and personal history of stochastic partial differential equations. Discrete & Continuous Dynamical Systems - A, 2021, 41 (1) : 471-487. doi: 10.3934/dcds.2020264 [6] Siyang Cai, Yongmei Cai, Xuerong Mao. A stochastic differential equation SIS epidemic model with regime switching. Discrete & Continuous Dynamical Systems - B, 2020  doi: 10.3934/dcdsb.2020317 [7] Xuhui Peng, Rangrang Zhang. Approximations of stochastic 3D tamed Navier-Stokes equations. Communications on Pure & Applied Analysis, 2020, 19 (12) : 5337-5365. doi: 10.3934/cpaa.2020241 [8] Yahia Zare Mehrjerdi. A new methodology for solving bi-criterion fractional stochastic programming. Numerical Algebra, Control & Optimization, 2020  doi: 10.3934/naco.2020054 [9] Yueyang Zheng, Jingtao Shi. A stackelberg game of backward stochastic differential equations with partial information. Mathematical Control & Related Fields, 2020  doi: 10.3934/mcrf.2020047 [10] Reza Chaharpashlou, Abdon Atangana, Reza Saadati. On the fuzzy stability results for fractional stochastic Volterra integral equation. Discrete & Continuous Dynamical Systems - S, 2020  doi: 10.3934/dcdss.2020432 [11] Christian Beck, Lukas Gonon, Martin Hutzenthaler, Arnulf Jentzen. On existence and uniqueness properties for solutions of stochastic fixed point equations. Discrete & Continuous Dynamical Systems - B, 2020  doi: 10.3934/dcdsb.2020320 [12] Giuseppina Guatteri, Federica Masiero. Stochastic maximum principle for problems with delay with dependence on the past through general measures. Mathematical Control & Related Fields, 2020  doi: 10.3934/mcrf.2020048 [13] Yangrong Li, Shuang Yang, Qiangheng Zhang. Odd random attractors for stochastic non-autonomous Kuramoto-Sivashinsky equations without dissipation. Electronic Research Archive, 2020, 28 (4) : 1529-1544. doi: 10.3934/era.2020080 [14] Pengyu Chen. Non-autonomous stochastic evolution equations with nonlinear noise and nonlocal conditions governed by noncompact evolution families. Discrete & Continuous Dynamical Systems - A, 2020  doi: 10.3934/dcds.2020383 [15] Lin Shi, Xuemin Wang, Dingshi Li. Limiting behavior of non-autonomous stochastic reaction-diffusion equations with colored noise on unbounded thin domains. Communications on Pure & Applied Analysis, 2020, 19 (12) : 5367-5386. doi: 10.3934/cpaa.2020242

2019 Impact Factor: 1.105

## Metrics

• PDF downloads (140)
• HTML views (80)
• Cited by (0)

## Other articlesby authors

• on AIMS
• on Google Scholar

[Back to Top]