\`x^2+y_1+z_12^34\`
Advanced Search
Article Contents
Article Contents

Convergence properties of stochastic proximal subgradient method in solving a class of composite optimization problems with cardinality regularizer

  • *Corresponding author: Nachuan Xiao

    *Corresponding author: Nachuan Xiao

The research of Xiaoyin Hu is supported by Zhejiang Provincial Natural Science Foundation of China under Grant (No. LQ23A010002), the National Natural Science Foundation of China (Grant No. 12301408), Scientific research project of Zhejiang Provincial Education Department (Y202248716), Scientific Research Foundation of Hangzhou City University (No.J-202317), and the advanced computing resources provided by the Supercomputing Center of HZCU. The research of Xin Liu is supported in part by the National Natural Science Foundation of China (12125108, 11971466, 12226008, 11991021, 11991020, 12021001, 12288201), Key Research Program of Frontier Sciences, Chinese Academy of Sciences (ZDBSLY-7022), and CAS AMSS-PolyU Joint Laboratory of Applied Mathematics. The research of Nachuan Xiao is supported by the Ministry of Education, Singapore, under its Academic Research Fund Tier 3 grant call (MOE-2019-T3-1-010).

Abstract / Introduction Full Text(HTML) Figure(1) Related Papers Cited by
  • In this paper, we study a class of composite optimization problems, whose objective function is the summation of a bunch of nonsmooth nonconvex loss functions and a cardinality regularizer. Firstly we investigate the optimality condition of these problems and then suggest a stochastic proximal subgradient method (SPSG) to solve them. Then we establish the almost surely subsequence convergence of SPSG under mild assumptions. We emphasize that these assumptions are satisfied by a wide range of problems arising from training neural networks. Furthermore, we conduct preliminary numerical experiments to demonstrate the effectiveness and efficiency of SPSG in solving this class of problems.

    Mathematics Subject Classification: Primary: 90C30, 58F17; Secondary: 65K05.

    Citation:

    \begin{equation} \\ \end{equation}
  • 加载中
  • Figure 1.  Test results on applying (SPSG) for training ResNet18 on CIFAR-10 dataset with different choices of $ \delta_{reg} $

  • [1] H. AttouchJ. Bolte and B. F. Svaiter, Convergence of descent methods for semi-algebraic and tame problems: Proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods, Mathematical Programming, 137 (2013), 91-129.  doi: 10.1007/s10107-011-0484-9.
    [2] C. Bao, B. Dong, L. Hou, Z. Shen, X. Zhang and X. Zhang, Image restoration by minimizing zero norm of wavelet frame coefficients, Inverse Problems, 32 (2016), 115004, 27 pp. doi: 10.1088/0266-5611/32/11/115004.
    [3] A. G. Baydin, B. A. Pearlmutter, A. A. Radul and J. M. Siskind, Automatic differentiation in machine learning: A survey, Journal of Marchine Learning Research, 18 (2017), Paper No. 153, 43 pp.
    [4] A. Beck and N. Hallak, Proximal mapping for symmetric penalty and sparsity, SIAM Journal on Optimization, 28 (2018), 496-527.  doi: 10.1137/17M1116544.
    [5] M. BenaïmJ. Hofbauer and S. Sorin, Stochastic approximations and differential inclusions, SIAM Journal on Control and Optimization, 44 (2005), 328-348.  doi: 10.1137/S0363012904439301.
    [6] D. BertsimasA. King and R. Mazumder, Best Subset Selection via a Modern Optimization Lens, Ann. Statist., 44 (2016), 813-852.  doi: 10.1214/15-AOS1388.
    [7] W. Bian and X. Chen, A smoothing proximal gradient algorithm for nonsmooth convex regression with cardinality penalty, SIAM Journal on Numerical Analysis, 58 (2020), 858-883.  doi: 10.1137/18M1186009.
    [8] P. BianchiW. Hachem and S. Schechtman, Convergence of constant step stochastic gradient descent for non-smooth non-convex functions, Set-Valued and Variational Analysis, 30 (2022), 1117-1147.  doi: 10.1007/s11228-022-00638-z.
    [9] E. Bierstone and P. D. Milman, Semianalytic and subanalytic sets, Inst. Hautes Études Sci. Publ. Math., 1988 (1988), 5-42. 
    [10] T. Blumensath and M. E. Davies, Iterative thresholding for sparse approximations, Journal of Fourier analysis and Applications, 14 (2008), 629-654.  doi: 10.1007/s00041-008-9035-z.
    [11] J. BolteA. DaniilidisA. Lewis and M. Shiota, Clarke subgradients of stratifiable functions, SIAM Journal on Optimization, 18 (2007), 556-572.  doi: 10.1137/060670080.
    [12] J. Bolte, T. Le, E. Pauwels and T. Silveti-Falls, Nonsmooth implicit differentiation for machine-learning and optimization, Advances in Neural Information Processing Systems, 34 (2021).
    [13] J. Bolte and E. Pauwels, A mathematical model for automatic differentiation in machine learning, Advances in Neural Information Processing Systems, 33 (2020), 10809-10819. 
    [14] J. Bolte and E. Pauwels, Conservative set valued fields, automatic differentiation, stochastic gradient methods and deep learning, Mathematical Programming, 188 (2021), 19-51.  doi: 10.1007/s10107-020-01501-5.
    [15] J. Bolte, E. Pauwels and A. J. Silveti-Falls, Differentiating nonsmooth solutions to parametric monotone inclusion problems, arXiv preprint, arXiv: 2212.07844, 2022.
    [16] J. BolteE. Pauwels and S. Vaiter, Automatic differentiation of nonsmooth iterative algorithms, Advances in Neural Information Processing Systems, 35 (2022), 26404-26417. 
    [17] V. S. BorkarStochastic Approximation: A Dynamical Systems Viewpoint, Cambridge University Press, CambridgeHindustan Book Agency, New Delhi, 2008. 
    [18] J. V. Burke, F. E. Curtis, A. S. Lewis, M. L. Overton and L. E. A. Sim oes, Gradient sampling methods for nonsmooth optimization, Numerical Nonsmooth Optimization, (2020), 201-225. doi: 10.1007/978-3-030-34910-3_6.
    [19] J. V. BurkeA. S. Lewis and M. L. Overton, Approximating subdifferentials by random sampling of gradients, Mathematics of Operations Research, 27 (2002), 567-584.  doi: 10.1287/moor.27.3.567.317.
    [20] J. V. BurkeA. S. Lewis and M. L. Overton, A robust gradient sampling algorithm for nonsmooth, nonconvex optimization, SIAM Journal on Optimization, 15 (2005), 751-779.  doi: 10.1137/030601296.
    [21] E. J. CandèsJ. Romberg and T. Tao, Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information, IEEE Transactions on information theory, 52 (2006), 489-509.  doi: 10.1109/TIT.2005.862083.
    [22] E. J. Candes and T. Tao, Decoding by linear programming, IEEE Transactions on Information Theory, 51 (2005), 4203-4215.  doi: 10.1109/TIT.2005.858979.
    [23] C. Castera, J. Bolte, C. Févotte and E. Pauwels, An inertial newton algorithm for deep learning, The Journal of Machine Learning Research, 22 (2021), Paper No. 134, 31 pp. doi: 10.22405/2226-8383-2021-22-2-121-134.
    [24] F. H. Clarke, Optimization and Nonsmooth Analysis, volume 5., SIAM, 1990. doi: 10.1137/1.9781611971309.
    [25] D. DavisD. DrusvyatskiyS. Kakade and J. D. Lee, Stochastic subgradient method converges on tame functions, Foundations of Computational Mathematics, 20 (2020), 119-154.  doi: 10.1007/s10208-018-09409-5.
    [26] D. Davis and B. Grimmer, Proximally guided stochastic subgradient method for nonsmooth, nonconvex problems, SIAM Journal on Optimization, 29 (2019), 1908-1930.  doi: 10.1137/17M1151031.
    [27] D. L. Donoho, Compressed sensing, IEEE Transactions on Information Theory, 52 (2006), 1289-1306.  doi: 10.1109/TIT.2006.871582.
    [28] J. C. DuchiP. L. Bartlett and M. J. Wainwright, Randomized smoothing for stochastic optimization, SIAM Journal on Optimization, 22 (2012), 674-701.  doi: 10.1137/110831659.
    [29] J. C. Duchi and F. Ruan, Stochastic methods for composite and weakly convex optimization problems, SIAM Journal on Optimization, 28 (2018), 3229-3259.  doi: 10.1137/17M1135086.
    [30] S. Han, H. Mao and W. J. Dally, Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, arXiv preprint, arXiv: 1510.00149, 2015.
    [31] K. He, X. Zhang, S. Ren and J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2016), 770-778. doi: 10.1109/CVPR.2016.90.
    [32] X. Hu, N. Xiao, X. Liu and K.-C. Toh, A constraint dissolving approach for nonsmooth optimization over the Stiefel manifold, arXiv preprint, arXiv: 2205.10500, 2022.
    [33] X. HuN. XiaoX. Liu and K.-C. Toh, An improved unconstrained approach for bilevel optimization, SIAM Journal on Optimization, 33 (2023), 2801-2829.  doi: 10.1137/22M1513034.
    [34] A. D. Ioffe, An invitation to tame optimization, SIAM Journal on Optimization, 19 (2008), 1894-1917.  doi: 10.1137/080722059.
    [35] A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny images, 2009.
    [36] T. Le, Nonsmooth nonconvex stochastic heavy ball, arXiv preprint, arXiv: 2304.13328, 2023.
    [37] T. Lê Loi, Lecture 1: O-minimal structures, in The Japanese-Australian Workshop on Real and Complex Singularities: JARCS III, Australian National University, Mathematical Sciences Institute, 43 (2010), 19-31.
    [38] W. LiW. Bian and K.-C. Toh, Difference-of-convex algorithms for a class of sparse group $ \ell_0$-regularized optimization problems, SIAM Journal on Optimization, 32 (2022), 1614-1641.  doi: 10.1137/21M1443455.
    [39] T. Lin, Z. Zheng and M. I. Jordan, Gradient-free methods for deterministic and stochastic nonsmooth nonconvex optimization, arXiv preprint, arXiv: 2209.05045, 2022.
    [40] C. Louizos, M. Welling and D. P. Kingma, Learning sparse neural networks through $ \ell_0$ regularization, arXiv preprint, arXiv: 1712.01312, 2017.
    [41] D. Molchanov, A. Ashukha and D. Vetrov, Variational dropout sparsifies deep neural networks, In International Conference on Machine Learning, (2017), 2498-2507. PMLR.
    [42] Y. Nesterov and V. Spokoiny, Random gradient-free minimization of convex functions, Foundations of Computational Mathematics, 17 (2017), 527-566.  doi: 10.1007/s10208-015-9296-2.
    [43] S. Schechtman, Stochastic proximal subgradient descent oscillates in the vicinity of its accumulation set, Optimization Letters, 17 (2023), 177-190.  doi: 10.1007/s11590-022-01884-8.
    [44] E. SoubiesL. Blanc-Féraud and G. Aubert, A continuous exact $ \ell_0$ penalty (CEL0) for least squares regularized problem, SIAM Journal on Imaging Sciences, 8 (2015), 1607-1639.  doi: 10.1137/151003714.
    [45] K. Ullrich, E. Meeds and M. Welling, Soft weight-sharing for neural network compression, rXiv preprint, arXiv: 1702.04008, 2017.
    [46] A. J. Wilkie, Model completeness results for expansions of the ordered field of real numbers by restricted Pfaffian functions and the exponential function, Journal of the American Mathematical Society, 9 (1996), 1051-1094.  doi: 10.1090/S0894-0347-96-00216-0.
    [47] N. Xiao, X. Hu, X. Liu and K.-C. Toh, Adam-family methods for nonsmooth optimization with convergence guarantees, arXiv preprint, arXiv: 2305.03938, 2023.
    [48] N. Xiao, X. Hu and K.-C. Toh, Convergence guarantees for stochastic subgradient methods in nonsmooth nonconvex optimization, arXiv preprint, arXiv: 2307.10053, 2023.
    [49] F. YousefianA. Nedić and U. V. Shanbhag, On stochastic gradient and subgradient methods with adaptive steplength sequences, Automatica, 48 (2012), 56-67.  doi: 10.1016/j.automatica.2011.09.043.
    [50] J. YunA. C. Lozano and E. Yang, Adaptive proximal gradient methods for structured neural networks, Advances in Neural Information Processing Systems, 34 (2021), 24365-24378. 
    [51] J. Zhang, H. Lin, S. Jegelka, S. Sra and A. Jadbabaie, Complexity of finding stationary points of nonconvex nonsmooth functions, in International Conference on Machine Learning, (2020), 11173-11182. PMLR.
    [52] S. ZhouL. Pan and N. Xiu, Newton method for $ \ell_0$-regularized optimization, Numerical Algorithms, 88 (2021), 1541-1570.  doi: 10.1007/s11075-021-01085-x.
  • 加载中

Figures(1)

SHARE

Article Metrics

HTML views(2843) PDF downloads(325) Cited by(0)

Access History

Other Articles By Authors

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return