\`x^2+y_1+z_12^34\`
Advanced Search
Article Contents
Article Contents

Stochastic AUC optimization with general loss

  • * Corresponding author

    * Corresponding author 
This work was completed when Wei Shen was a visiting student at SUNY Albany. Yiming Ying is supported by the National Science Foundation (NSF, Grant IIS1816227)
Abstract Full Text(HTML) Figure(2) / Table(3) Related Papers Cited by
  • Recently, there is considerable work on developing efficient stochastic optimization algorithms for AUC maximization. However, most of them focus on the least square loss which may be not the best option in practice. The main difficulty for dealing with the general convex loss is the pairwise nonlinearity w.r.t. the sampling distribution generating the data. In this paper, we use Bernstein polynomials to uniformly approximate the general losses which are able to decouple the pairwise nonlinearity. In particular, we show that this reduction for AUC maximization with a general loss is equivalent to a weakly convex (nonconvex) min-max formulation. Then, we develop a novel SGD algorithm for AUC maximization with per-iteration cost linearly w.r.t. the data dimension, making it amenable for streaming data analysis. Despite its non-convexity, we prove its global convergence by exploring the appealing convexity-preserving property of Bernstein polynomials and the intrinsic structure of the min-max formulation. Experiments are performed to validate the effectiveness of the proposed approach.

    Mathematics Subject Classification: Primary: 58F15, 58F17; Secondary: 53C35.

    Citation:

    \begin{equation} \\ \end{equation}
  • 加载中
  • Figure 1.  Comparison of convergence speed between SAUC-H and $ \text{OAM}_{gra} $

    Figure 2.  Evaluation of AUC scores vesus the degree of the Bernstein polynomial

    Algorithm 1: Stochastic AUC Optimization (SAUC)

    1: Input: $ R>0 $, $ \gamma\geq\gamma_0 $ and $ \beta>0 $.
    2: Initialize $ \bar{{\mathbf{v}}}_0 = 0 $ and $ \bar{{\mathit{\boldsymbol{\alpha}}}}_0 = 0 $.
    3: for $ t=1 $ to $ T-1 $ do
    4: Set $ {\mathbf{v}}_0^t = \bar{{\mathbf{v}}}_{t-1}, {\mathit{\boldsymbol{\alpha}}}_0^t = \bar{{\mathit{\boldsymbol{\alpha}}}}_{t-1} $ and $ \eta_t = \frac{\beta}{\sqrt{t}}. $
    5: for $ j=1 $ to $ t $ do
    6: Randomly sample $ z_j^t = (x_j^t,y_j^t) $ and compute
    $ \begin{align*} &{\mathbf{v}}_{j}^t = {{\bf Proj}}_{{\Omega}_1} \bigl({\mathbf{v}}_{j-1}^t - \eta_t \nabla_{{\mathbf{v}}} \varPhi_{\gamma}^t({\mathbf{v}}_{j-1}^t,{\mathit{\boldsymbol{\alpha}}}_{j-1}^t;z_j^t)\bigr), &{\mathit{\boldsymbol{\alpha}}}_{j}^t = {{\bf Proj}}_{{\Omega}_2} \bigl({\mathit{\boldsymbol{\alpha}}}_{j-1}^t + \eta_t \nabla_{{\mathit{\boldsymbol{\alpha}}}} \varPhi_{\gamma}^t({\mathbf{v}}_{j-1}^t,{\mathit{\boldsymbol{\alpha}}}_{j-1}^t;z_j^t)\bigr) \end{align*} $
    7: end for
    8: Compute $ \bar{{\mathbf{v}}}_{t} = \frac{1}{t}\sum_{j=0}^{t-1} {\mathbf{v}}_j^t $ and $ \bar{{\mathit{\boldsymbol{\alpha}}}}_{t} = \frac{1}{t}\sum_{j=0}^{t-1} {\mathit{\boldsymbol{\alpha}}}_j^t. $
    9: end for
    10: Output: $ \widetilde{{\mathbf{v}}}_T:=\frac{1}{T}\sum_{t=0}^{T-1}\bar{{\mathbf{v}}}_{t} $ and $ \widetilde{{\mathit{\boldsymbol{\alpha}}}}_T:=\frac{1}{T}\sum_{t=0}^{T-1}\bar{{\mathit{\boldsymbol{\alpha}}}}_{t}. $
     | Show Table
    DownLoad: CSV

    Table 1.  Statistics of datasets

     | Show Table
    DownLoad: CSV

    Table 2.  Comparison of AUC score (mean$ \pm $std) on test data; OPAUC on news20 and sector does not converge in a reasonable time limit. Best AUC value on each dataset is in bold and second is underlined

     | Show Table
    DownLoad: CSV
  • [1] F. Bach and E. Moulines, Non-strongly-convex smooth stochastic approximation with convergence rate O (1/n), in Advances in Neural Information Processing Systems, (2013), 773–781.
    [2] A. P. Bradley, The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recognit., 30 (1997), 1145-1159.  doi: 10.1016/S0031-3203(96)00142-2.
    [3] T. Calders and S. Jaroszewicz, Efficient AUC optimization for classification, in PKDD, Vol. 4702, Springer, (2007), 42–53.
    [4] C. C. Chang and C. J. Lin, LIBSVM: a library for support vector machines, ACM Trans. Intell. Syst. Technol., 2 (2011), 21 pp. doi: 10.1145/1961189.1961199.
    [5] S. ClémençonG. Lugosi and N. Vayatis, Ranking and empirical minimization of U-statistics, Ann. Statist., 36 (2008), 844-874.  doi: 10.1214/009052607000000910.
    [6] C. Cortes and M. Mohri, AUCoptimization vs. error rate minimization, in Advances in Neural Information Processing Systems, (2004), 313–320.
    [7] D. Davis and D. Drusvyatskiy, Stochastic model-based minimization of weakly convex functions, SIAM J. Optim., 29 (2019), 207-239.  doi: 10.1137/18M1178244.
    [8] D. Davis and B. Grimmer, Proximally Guided Stochastic Subgradient Method for Nonsmooth, Nonconvex Problems, SIAM J. Optim., 29 (2019), 1908-1930.  doi: 10.1137/17M1151031.
    [9] Dheeru, Dua and Karra Taniskidou, Efi, UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences, 2017. Available from: http://archive.ics.uci.edu/ml.
    [10] D. Drusvyatskiy, The proximal point method revisited, preprint, arXiv: 1712.06038.
    [11] T. Fawcett, An introduction to ROC analysis, Pattern Recognit. Lett., 27 (2006), 861-874. 
    [12] W. Gao, R. Jin, S. Zhu and Z. H. Zhou, One-pass AUC optimization, in International Conference on Machine Learning, (2013), 906–914.
    [13] W. Gao and Z. H. Zhou, On the Consistency of AUC Pairwise Optimization, in IJCAI, (2015), 939–945.
    [14] J. A. Hanley and B. J. McNeil, The meaning and use of the area under a receiver operating characteristic (ROC) curve, Radiology, 143 (1982), 29-36. 
    [15] A. Herschtal and B. Raskutti, Optimising area under the ROC curve using gradient descent, in Proceedings of the 21st International Conference on Machine Learning, ACM, (2004), 49.
    [16] T. Joachims, A support vector method for multivariate performance measures, in Proceedings of the 22nd International Conference on Machine Learning, ACM, (2005), 377–384.
    [17] P. Kar, B. Sriperumbudur, P. Jain and H. Karnick, On the generalization ability of online learning algorithms for pairwise loss functions, in International Conference on Machine Learning, (2013), 441–449.
    [18] S. Lacoste-Julien, M. Schmidt and F. Bach, A simpler approach to obtaining an O (1/t) convergence rate for the projected stochastic subgradient method, preprint, arXiv: 1212.2002. doi: 10.1137/1.9781611974331.ch127.
    [19] J. Lin and L. Rosasco, Optimal learning for multi-pass stochastic gradient methods, in Advances in Neural Information Processing Systems, (2016), 4556–4564.
    [20] M. Liu, Z. Yuan, Y. Ying and T. Yang, Stochastic AUC Maximization with Deep Neural Networks, in International Conference on Learning Representations (ICLR), 2020.
    [21] M. Liu, X. Zhang, Z. Chen, X. Wang and T. Yang, Fast stochastic AUC maximization with O (1/n)-convergence rate, in International Conference on Machine Learning, (2018), 3195–3203.
    [22] M. Natole, Y. Ying and S. Lyu, Stochastic proximal algorithms for AUC maximization, in International Conference on Machine Learning, (2018), 3707–3716.
    [23] A. NemirovskiA. JuditskyG. Lan and A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM J. Optim., 19 (2009), 1574-1609.  doi: 10.1137/070704277.
    [24] E. A. Nurminskii, The quasigradient method for the solving of the nonlinear programming problems, Cybernetics, 9 (1973), 145-150. 
    [25] G. M. Phillips, Interpolation and Approximation by Polynomials, Vol. 14, Springer Science & Business Media, 2003. doi: 10.1007/b97417.
    [26] M. J. D. PowellApproximation Theory and Methods, Cambridge University Press, 1981. 
    [27] H. Rafique, M. Liu, Q. Lin and T. Yang, Non-Convex Min-Max Optimization: Provable Algorithms and Applications in Machine Learning, preprint, arXiv: 1810.02060.
    [28] A. Rakhlin, O. Shamir and K. Sridharan, Making gradient descent optimal for strongly convex stochastic optimization, in Proceedings of the 29th International Conference on Machine Learning, (2012), 449–456.
    [29] O. Shamir and T. Zhang, Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes, in International Conference on Machine Learning, (2013), 71–79.
    [30] N. Srebro, A. Tewari, Stochastic optimization for machine learning, CML Tutorial, (2010).
    [31] Y. Wang, R. Khardon, D. Pechyony and R. Jones, Generalization bounds for online learning algorithms with pairwise loss functions, in Conference on Learning Theory, (2012), 13.
    [32] Y. Ying, L. Wen and S. Lyu, Stochastic online AUC maximization, in Advances in Neural Information Processing Systems, 2016.
    [33] Y. Ying and D. X. Zhou, Online regularized classification algorithms, IEEE Trans. Inform. Theory, 52 (2006), 4775-4788.  doi: 10.1109/TIT.2006.883632.
    [34] Y. Ying and D. X. Zhou, Online pairwise learning algorithms, Neural Comput., 28 (2016), 743-777.  doi: 10.1162/neco_a_00817.
    [35] X. ZhangA. Saha and S. V. N. Vishwanathan, Smoothing multivariate performance measures, J. Mach. Learn. Res., 13 (2012), 3623-3680. 
    [36] P. Zhao, R. Jin, T. Yang and S. C. Hoi, Online AUC maximization, in Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011.
  • 加载中

Figures(2)

Tables(3)

SHARE

Article Metrics

HTML views(1502) PDF downloads(316) Cited by(0)

Access History

Other Articles By Authors

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return