\`x^2+y_1+z_12^34\`
Advanced Search
Article Contents
Article Contents

Variable metric proximal stochastic variance reduced gradient methods for nonconvex nonsmooth optimization

  • * Corresponding author: Xin-Wei Liu

    * Corresponding author: Xin-Wei Liu 
Abstract Full Text(HTML) Figure(4) / Table(4) Related Papers Cited by
  • We study the problem of minimizing the sum of two functions. The first function is the average of a large number of nonconvex component functions and the second function is a convex (possibly nonsmooth) function that admits a simple proximal mapping. With a diagonal Barzilai-Borwein stepsize for updating the metric, we propose a variable metric proximal stochastic variance reduced gradient method in the mini-batch setting, named VM-SVRG. It is proved that VM-SVRG converges sublinearly to a stationary point in expectation. We further suggest a variant of VM-SVRG to achieve linear convergence rate in expectation for nonconvex problems satisfying the proximal Polyak-Łojasiewicz inequality. The complexity of VM-SVRG is lower than that of the proximal gradient method and proximal stochastic gradient method, and is the same as the proximal stochastic variance reduced gradient method. Numerical experiments are conducted on standard data sets. Comparisons with other advanced proximal stochastic gradient methods show the efficiency of the proposed method.

    Mathematics Subject Classification: Primary: 90C06; Secondary: 90C30.

    Citation:

    \begin{equation} \\ \end{equation}
  • 加载中
  • Figure 1.  Comparison of VM-SVRG with different mini-batch sizes

    Figure 2.  Comparison of VM-SVRG and other modern methods for solving LR problem

    Figure 3.  Comparison of VM-SVRG with different mini-batch sizes

    Figure 4.  Comparison of VM-SVRG and mS2GD for solving SVM problem

    Algorithm 2 PL-VM-SVRG($ w^0,m,b,U_0 $)
    Input: Number of inner iterations $ m $, initial point $ \tilde{w}_0=w^0 \in \mathbb{R}^d $, initial metric $ U_0 $, mini-batch size $ b\in \{1,2,\ldots,n\} $;
    1: for $ s=0,1,\ldots,S-1 $ do
    2:    $ w^{s+1} = $ VM-SVRG($ w^s,m,b,U_0 $).
    3: end for
    Output: $ w^S $.
     | Show Table
    DownLoad: CSV

    Table 1.  Comparison of the SFO and PO complexity.

    Complexity Prox-GD Prox-SGD Prox-SVRG VM-SVRG
    SFO $ \mathcal{O}(n/\epsilon) $ $ \mathcal{O}(1/\epsilon^2) $ $ \mathcal{O}(n+(n^{2/3}/\epsilon)) $ $ \mathcal{O}(n+(n^{2/3}/\epsilon)) $
    PO $ \mathcal{O}(1/\epsilon) $ $ \mathcal{O}(1/\epsilon) $ $ \mathcal{O}(1/\epsilon) $ $ \mathcal{O}(1/\epsilon) $
    SFO(PL) $ \mathcal{O}(n\kappa\log(1/\epsilon)) $ $ \mathcal{O}(1/\epsilon^2) $ $ \mathcal{O}((n+\kappa n^{2/3})\log(1/\epsilon)) $ $ \mathcal{O}((n+\kappa n^{2/3})\log(1/\epsilon)) $
    PO(PL) $ \mathcal{O}(\kappa \log(1/\epsilon)) $ $ \mathcal{O}(1/\epsilon) $ $ \mathcal{O}(\kappa \log(1/\epsilon)) $ $ \mathcal{O}(\kappa \log(1/\epsilon)) $
     | Show Table
    DownLoad: CSV

    Table 2.  The information of data sets

    Data sets $ n $ $ d $
    ijcnn1 49, 990 22
    rcv1 20, 242 47, 236
    real-sim 72, 309 20, 958
    covtype 581, 012 54
     | Show Table
    DownLoad: CSV

    Table 3.  Best choices of parameters for the methods

    Data sets mS2GD$ (m,\eta) $ mS2GD-BB mSARAH$ (m,\eta) $ mSARAH-BB VM-SVRG
    ijcnn1 $ (0.02n,\frac{4}{L}) $ $ 0.04n $ $ (0.05n,\frac{1.8}{L}) $ $ 0.04n $ $ 0.04n $
    rcv1 $ (0.1n,\frac{4}{L}) $ $ 0.11n $ $ (0.1n,\frac{3.5}{L}) $ $ 0.09n $ $ 0.25n $
    real-sim $ (0.12n,\frac{0.6}{L}) $ $ 0.15n $ $ (0.07n,\frac{2}{L}) $ $ 0.06 $ $ 0.11n $
    covtype $ (0.07n,\frac{21}{L}) $ $ 0.03n $ $ (0.07n,\frac{25}{L}) $ $ 0.008n $ $ 0.01n $
     | Show Table
    DownLoad: CSV
  • [1] J. Barzilai and J. M. Borwein, Two-point step size gradient methods, IMA J. Numer. Anal., 8 (1988), 141-148.  doi: 10.1093/imanum/8.1.141.
    [2] J. F. BonnansJ. Ch. GilbertC. Lemaréchal and C. A. Sagastizábal, A family of variable metric proximal methods, Math. Programming, 68 (1995), 15-47.  doi: 10.1007/BF01585756.
    [3] L. BottouF. E. Curtis and J. Nocedal, Optimization methods for large-scale machine learning, SIAM Rev., 60 (2018), 223-311.  doi: 10.1137/16M1080173.
    [4] Y.-H. DaiM. Al-Baali and X. Yang, A positive Barzilai-Borwein-like stepsize and an extension for symmetric linear systems, Numerical Analysis and Optimization, 134 (2015), 59-75.  doi: 10.1007/978-3-319-17689-5_3.
    [5] Y.-H. DaiY. Huang and X.-W. Liu, A family of spectral gradient methods for optimization, Comput. Optim. Appl., 74 (2019), 43-65.  doi: 10.1007/s10589-019-00107-8.
    [6] A. Defazio, F. Bach and S. Lacoste-Julien, SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives, in Advances in Neural Information Processing Systems, (2014), 1646-1654.
    [7] R. Fletcher, On the Barzilai-Borwein method, in Optimization and control with applications, Springer, New York, 96 (2005), 235-256. doi: 10.1007/0-387-24255-4_10.
    [8] S. Ghadimi and G. Lan, Stochastic first-and zeroth-order methods for nonconvex stochastic programming, SIAM J. Optim., 23 (2013), 2341-2368.  doi: 10.1137/120880811.
    [9] S. GhadimiG. Lan and H. Zhang, Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization, Math. Program., 155 (2016), 267-305.  doi: 10.1007/s10107-014-0846-1.
    [10] T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer Series in Statistics. Springer, New York, 2009. doi: 10.1007/978-0-387-84858-7.
    [11] Y. HuangY.-H. DaiX.-W. Liu and H. Zhang, Gradient methods exploiting spectral properties, Optim. Methods Softw., 35 (2020), 681-705.  doi: 10.1080/10556788.2020.1727476.
    [12] R. Johnson and T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, in Advances in Neural Information Processing Systems, (2013), 315-323.
    [13] H. Karimi, J. Nutini and M. Schmidt, Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition, in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, (2016), 795-811. doi: 10.1007/978-3-319-46128-1_50.
    [14] J. KonečnỳJ. LiuP. Richtárik and M. Takáč, Mini-batch semi-stochastic gradient descent in the proximal setting, IEEE Journal of Selected Topics in Signal Processing, 10 (2015), 242-255. 
    [15] J. Konečnỳ and P. Richtárik, Semi-stochastic gradient descent methods, Frontiers in Applied Mathematics and Statistics, 3 (2017), 9.
    [16] L. Lei, C. Ju, J. Chen and M. I. Jordan, Non-convex finite-sum optimization via SCSG methods, in Advances in Neural Information Processing Systems, (2017), 2348-2358.
    [17] Z. Li and J. Li, A simple proximal stochastic gradient method for nonsmooth nonconvex optimization, in Advances in Neural Information Processing Systems, (2018), 5564-5574.
    [18] Y. LiuX. Wang and T. Guo, A linearly convergent stochastic recursive gradient method for convex optimization, Optim. Lett., 14 (2020), 2265-2283.  doi: 10.1007/s11590-020-01550-x.
    [19] Y. Nesterov, Introductory Lectures on Convex Programming: A Basic Course, Applied Optimization, 87. Kluwer Academic Publishers, Boston, MA, 2004. doi: 10.1007/978-1-4419-8853-9.
    [20] L. M. NguyenJ. LiuK. Scheinberg and M. Takáč, SARAH: A novel method for machine learning problems using stochastic recursive gradient, Proceedings of the 34th International Conference on Machine Learning, 70 (2017), 2613-2621. 
    [21] L. A. ParenteP. A. Lotito and M. V. Solodov, A class of inexact variable metric proximal point algorithms, SIAM J. Optim., 19 (2008), 240-260.  doi: 10.1137/070688146.
    [22] N. Parikh and S. Boyd et al., Proximal algorithms, Foundations and Trends in Optimization, 1 (2014), 127-239. 
    [23] Y. Park, S. Dhar, S. Boyd and M. Shah, Variable metric proximal gradient method with diagonal Barzilai-Borwein stepsize, (2019), arXiv: 1910.07056.
    [24] N. H. Pham, L. M. Nguyen, D. T. Phan and Q. Tran-Dinh, ProxSARAH: An efficient algorithmic framework for stochastic composite nonconvex optimization, J. Mach. Learn. Res., 21 (2020), Paper No. 110, 48 pp.
    [25] S. J. Reddi, A. Hefny, S. Sra, B. Poczos and A. Smola, Stochastic variance reduction for nonconvex optimization, in International Conference on Machine Learning, (2016), 314-323.
    [26] S. J. Reddi, S. Sra, B. Poczos and A. J. Smola, Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization, in Advances in Neural Information Processing Systems, (2016), 1145-1153.
    [27] H. Robbins and S. Monro, A stochastic approximation method, Ann. Math. Statistics, 22 (1951), 400-407.  doi: 10.1214/aoms/1177729586.
    [28] M. SchmidtN. Le Roux and F. Bach, Minimizing finite sums with the stochastic average gradient, Math. Program., 162 (2017), 83-112.  doi: 10.1007/s10107-016-1030-6.
    [29] F. ShangK. ZhouH. LiuJ. ChengI. W. TsangL. ZhangD. Tao and L. Jiao, VR-SGD: A simple stochastic variance reduction method for machine learning, IEEE Transactions on Knowledge and Data Engineering, 32 (2020), 188-202.  doi: 10.1109/TKDE.2018.2878765.
    [30] C. Tan, S. Ma, Y. -H. Dai and Y. Qian, Barzilai-Borwein step size for stochastic gradient descent, in Advances in Neural Information Processing Systems, (2016), 685-693.
    [31] X. WangX. Wang and Y.-X. Yuan, Stochastic proximal quasi-Newton methods for non-convex composite optimization, Optim. Methods Softw., 34 (2019), 922-948.  doi: 10.1080/10556788.2018.1471141.
    [32] X. WangS. Wang and H. Zhang, Inexact proximal stochastic gradient method for convex composite optimization, Comput. Optim. Appl., 68 (2017), 579-618.  doi: 10.1007/s10589-017-9932-7.
    [33] X. Wang and H. Zhang, Inexact proximal stochastic second-order methods for nonconvex composite optimization, Optim. Methods Softw., 35 (2020), 808-835.  doi: 10.1080/10556788.2020.1713128.
    [34] L. Xiao and T. Zhang, A proximal stochastic gradient method with progressive variance reduction, SIAM J. Optim., 24 (2014), 2057-2075.  doi: 10.1137/140961791.
    [35] T. Yu, X. -W. Liu, Y. -H. Dai and J. Sun, A minibatch proximal stochastic recursive gradient algorithm using a trust-region-like scheme and Barzilai-Borwein stepsizes, IEEE Transactions on Neural Networks and Learning Systems, 2020. doi: 10.1109/TNNLS. 2020.3025383.
    [36] T. Yu, X. -W. Liu, Y. -H. Dai and J. Sun, Stochastic variance reduced gradient methods using a trust-region-like scheme, J. Sci. Comput., 87 (2021), Article number: 5. doi: 10.1007/s10915-020-01402-x.
    [37] T. Yu, X. -W. Liu, Y. -H. Dai and J. Sun, A variable metric mini-batch proximal stochastic recursive gradient algorithm with diagonal Barzilai-Borwein stepsize, 2020, arXiv: 2010.00817.
  • 加载中

Figures(4)

Tables(4)

SHARE

Article Metrics

HTML views(2068) PDF downloads(766) Cited by(0)

Access History

Other Articles By Authors

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return