In this paper, we study a class of composite optimization problems, whose objective function is the summation of a bunch of nonsmooth nonconvex loss functions and a cardinality regularizer. Firstly we investigate the optimality condition of these problems and then suggest a stochastic proximal subgradient method (SPSG) to solve them. Then we establish the almost surely subsequence convergence of SPSG under mild assumptions. We emphasize that these assumptions are satisfied by a wide range of problems arising from training neural networks. Furthermore, we conduct preliminary numerical experiments to demonstrate the effectiveness and efficiency of SPSG in solving this class of problems.
Citation: |
[1] |
H. Attouch, J. Bolte and B. F. Svaiter, Convergence of descent methods for semi-algebraic and tame problems: Proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods, Mathematical Programming, 137 (2013), 91-129.
doi: 10.1007/s10107-011-0484-9.![]() ![]() ![]() |
[2] |
C. Bao, B. Dong, L. Hou, Z. Shen, X. Zhang and X. Zhang, Image restoration by minimizing zero norm of wavelet frame coefficients, Inverse Problems, 32 (2016), 115004, 27 pp.
doi: 10.1088/0266-5611/32/11/115004.![]() ![]() ![]() |
[3] |
A. G. Baydin, B. A. Pearlmutter, A. A. Radul and J. M. Siskind, Automatic differentiation in machine learning: A survey, Journal of Marchine Learning Research, 18 (2017), Paper No. 153, 43 pp.
![]() ![]() |
[4] |
A. Beck and N. Hallak, Proximal mapping for symmetric penalty and sparsity, SIAM Journal on Optimization, 28 (2018), 496-527.
doi: 10.1137/17M1116544.![]() ![]() ![]() |
[5] |
M. Benaïm, J. Hofbauer and S. Sorin, Stochastic approximations and differential inclusions, SIAM Journal on Control and Optimization, 44 (2005), 328-348.
doi: 10.1137/S0363012904439301.![]() ![]() ![]() |
[6] |
D. Bertsimas, A. King and R. Mazumder, Best Subset Selection via a Modern Optimization Lens, Ann. Statist., 44 (2016), 813-852.
doi: 10.1214/15-AOS1388.![]() ![]() ![]() |
[7] |
W. Bian and X. Chen, A smoothing proximal gradient algorithm for nonsmooth convex regression with cardinality penalty, SIAM Journal on Numerical Analysis, 58 (2020), 858-883.
doi: 10.1137/18M1186009.![]() ![]() ![]() |
[8] |
P. Bianchi, W. Hachem and S. Schechtman, Convergence of constant step stochastic gradient descent for non-smooth non-convex functions, Set-Valued and Variational Analysis, 30 (2022), 1117-1147.
doi: 10.1007/s11228-022-00638-z.![]() ![]() ![]() |
[9] |
E. Bierstone and P. D. Milman, Semianalytic and subanalytic sets, Inst. Hautes Études Sci. Publ. Math., 1988 (1988), 5-42.
![]() ![]() |
[10] |
T. Blumensath and M. E. Davies, Iterative thresholding for sparse approximations, Journal of Fourier analysis and Applications, 14 (2008), 629-654.
doi: 10.1007/s00041-008-9035-z.![]() ![]() ![]() |
[11] |
J. Bolte, A. Daniilidis, A. Lewis and M. Shiota, Clarke subgradients of stratifiable functions, SIAM Journal on Optimization, 18 (2007), 556-572.
doi: 10.1137/060670080.![]() ![]() ![]() |
[12] |
J. Bolte, T. Le, E. Pauwels and T. Silveti-Falls, Nonsmooth implicit differentiation for machine-learning and optimization, Advances in Neural Information Processing Systems, 34 (2021).
![]() |
[13] |
J. Bolte and E. Pauwels, A mathematical model for automatic differentiation in machine learning, Advances in Neural Information Processing Systems, 33 (2020), 10809-10819.
![]() |
[14] |
J. Bolte and E. Pauwels, Conservative set valued fields, automatic differentiation, stochastic gradient methods and deep learning, Mathematical Programming, 188 (2021), 19-51.
doi: 10.1007/s10107-020-01501-5.![]() ![]() ![]() |
[15] |
J. Bolte, E. Pauwels and A. J. Silveti-Falls, Differentiating nonsmooth solutions to parametric monotone inclusion problems, arXiv preprint, arXiv: 2212.07844, 2022.
![]() |
[16] |
J. Bolte, E. Pauwels and S. Vaiter, Automatic differentiation of nonsmooth iterative algorithms, Advances in Neural Information Processing Systems, 35 (2022), 26404-26417.
![]() |
[17] |
V. S. Borkar, Stochastic Approximation: A Dynamical Systems Viewpoint, Cambridge University Press, CambridgeHindustan Book Agency, New Delhi, 2008.
![]() ![]() |
[18] |
J. V. Burke, F. E. Curtis, A. S. Lewis, M. L. Overton and L. E. A. Sim oes, Gradient sampling methods for nonsmooth optimization, Numerical Nonsmooth Optimization, (2020), 201-225.
doi: 10.1007/978-3-030-34910-3_6.![]() ![]() ![]() |
[19] |
J. V. Burke, A. S. Lewis and M. L. Overton, Approximating subdifferentials by random sampling of gradients, Mathematics of Operations Research, 27 (2002), 567-584.
doi: 10.1287/moor.27.3.567.317.![]() ![]() ![]() |
[20] |
J. V. Burke, A. S. Lewis and M. L. Overton, A robust gradient sampling algorithm for nonsmooth, nonconvex optimization, SIAM Journal on Optimization, 15 (2005), 751-779.
doi: 10.1137/030601296.![]() ![]() ![]() |
[21] |
E. J. Candès, J. Romberg and T. Tao, Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information, IEEE Transactions on information theory, 52 (2006), 489-509.
doi: 10.1109/TIT.2005.862083.![]() ![]() ![]() |
[22] |
E. J. Candes and T. Tao, Decoding by linear programming, IEEE Transactions on Information Theory, 51 (2005), 4203-4215.
doi: 10.1109/TIT.2005.858979.![]() ![]() ![]() |
[23] |
C. Castera, J. Bolte, C. Févotte and E. Pauwels, An inertial newton algorithm for deep learning, The Journal of Machine Learning Research, 22 (2021), Paper No. 134, 31 pp.
doi: 10.22405/2226-8383-2021-22-2-121-134.![]() ![]() ![]() |
[24] |
F. H. Clarke, Optimization and Nonsmooth Analysis, volume 5., SIAM, 1990.
doi: 10.1137/1.9781611971309.![]() ![]() ![]() |
[25] |
D. Davis, D. Drusvyatskiy, S. Kakade and J. D. Lee, Stochastic subgradient method converges on tame functions, Foundations of Computational Mathematics, 20 (2020), 119-154.
doi: 10.1007/s10208-018-09409-5.![]() ![]() ![]() |
[26] |
D. Davis and B. Grimmer, Proximally guided stochastic subgradient method for nonsmooth, nonconvex problems, SIAM Journal on Optimization, 29 (2019), 1908-1930.
doi: 10.1137/17M1151031.![]() ![]() ![]() |
[27] |
D. L. Donoho, Compressed sensing, IEEE Transactions on Information Theory, 52 (2006), 1289-1306.
doi: 10.1109/TIT.2006.871582.![]() ![]() ![]() |
[28] |
J. C. Duchi, P. L. Bartlett and M. J. Wainwright, Randomized smoothing for stochastic optimization, SIAM Journal on Optimization, 22 (2012), 674-701.
doi: 10.1137/110831659.![]() ![]() ![]() |
[29] |
J. C. Duchi and F. Ruan, Stochastic methods for composite and weakly convex optimization problems, SIAM Journal on Optimization, 28 (2018), 3229-3259.
doi: 10.1137/17M1135086.![]() ![]() ![]() |
[30] |
S. Han, H. Mao and W. J. Dally, Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, arXiv preprint, arXiv: 1510.00149, 2015.
![]() |
[31] |
K. He, X. Zhang, S. Ren and J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2016), 770-778.
doi: 10.1109/CVPR.2016.90.![]() ![]() |
[32] |
X. Hu, N. Xiao, X. Liu and K.-C. Toh, A constraint dissolving approach for nonsmooth optimization over the Stiefel manifold, arXiv preprint, arXiv: 2205.10500, 2022.
![]() |
[33] |
X. Hu, N. Xiao, X. Liu and K.-C. Toh, An improved unconstrained approach for bilevel optimization, SIAM Journal on Optimization, 33 (2023), 2801-2829.
doi: 10.1137/22M1513034.![]() ![]() ![]() |
[34] |
A. D. Ioffe, An invitation to tame optimization, SIAM Journal on Optimization, 19 (2008), 1894-1917.
doi: 10.1137/080722059.![]() ![]() ![]() |
[35] |
A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny images, 2009.
![]() |
[36] |
T. Le, Nonsmooth nonconvex stochastic heavy ball, arXiv preprint, arXiv: 2304.13328, 2023.
![]() |
[37] |
T. Lê Loi, Lecture 1: O-minimal structures, in The Japanese-Australian Workshop on Real and Complex Singularities: JARCS III, Australian National University, Mathematical Sciences Institute, 43 (2010), 19-31.
![]() |
[38] |
W. Li, W. Bian and K.-C. Toh, Difference-of-convex algorithms for a class of sparse group $ \ell_0$-regularized optimization problems, SIAM Journal on Optimization, 32 (2022), 1614-1641.
doi: 10.1137/21M1443455.![]() ![]() ![]() |
[39] |
T. Lin, Z. Zheng and M. I. Jordan, Gradient-free methods for deterministic and stochastic nonsmooth nonconvex optimization, arXiv preprint, arXiv: 2209.05045, 2022.
![]() |
[40] |
C. Louizos, M. Welling and D. P. Kingma, Learning sparse neural networks through $ \ell_0$ regularization, arXiv preprint, arXiv: 1712.01312, 2017.
![]() |
[41] |
D. Molchanov, A. Ashukha and D. Vetrov, Variational dropout sparsifies deep neural networks, In International Conference on Machine Learning, (2017), 2498-2507. PMLR.
![]() |
[42] |
Y. Nesterov and V. Spokoiny, Random gradient-free minimization of convex functions, Foundations of Computational Mathematics, 17 (2017), 527-566.
doi: 10.1007/s10208-015-9296-2.![]() ![]() ![]() |
[43] |
S. Schechtman, Stochastic proximal subgradient descent oscillates in the vicinity of its accumulation set, Optimization Letters, 17 (2023), 177-190.
doi: 10.1007/s11590-022-01884-8.![]() ![]() ![]() |
[44] |
E. Soubies, L. Blanc-Féraud and G. Aubert, A continuous exact $ \ell_0$ penalty (CEL0) for least squares regularized problem, SIAM Journal on Imaging Sciences, 8 (2015), 1607-1639.
doi: 10.1137/151003714.![]() ![]() ![]() |
[45] |
K. Ullrich, E. Meeds and M. Welling, Soft weight-sharing for neural network compression, rXiv preprint, arXiv: 1702.04008, 2017.
![]() |
[46] |
A. J. Wilkie, Model completeness results for expansions of the ordered field of real numbers by restricted Pfaffian functions and the exponential function, Journal of the American Mathematical Society, 9 (1996), 1051-1094.
doi: 10.1090/S0894-0347-96-00216-0.![]() ![]() ![]() |
[47] |
N. Xiao, X. Hu, X. Liu and K.-C. Toh, Adam-family methods for nonsmooth optimization with convergence guarantees, arXiv preprint, arXiv: 2305.03938, 2023.
![]() |
[48] |
N. Xiao, X. Hu and K.-C. Toh, Convergence guarantees for stochastic subgradient methods in nonsmooth nonconvex optimization, arXiv preprint, arXiv: 2307.10053, 2023.
![]() |
[49] |
F. Yousefian, A. Nedić and U. V. Shanbhag, On stochastic gradient and subgradient methods with adaptive steplength sequences, Automatica, 48 (2012), 56-67.
doi: 10.1016/j.automatica.2011.09.043.![]() ![]() ![]() |
[50] |
J. Yun, A. C. Lozano and E. Yang, Adaptive proximal gradient methods for structured neural networks, Advances in Neural Information Processing Systems, 34 (2021), 24365-24378.
![]() |
[51] |
J. Zhang, H. Lin, S. Jegelka, S. Sra and A. Jadbabaie, Complexity of finding stationary points of nonconvex nonsmooth functions, in International Conference on Machine Learning, (2020), 11173-11182. PMLR.
![]() |
[52] |
S. Zhou, L. Pan and N. Xiu, Newton method for $ \ell_0$-regularized optimization, Numerical Algorithms, 88 (2021), 1541-1570.
doi: 10.1007/s11075-021-01085-x.![]() ![]() ![]() |
Test results on applying (SPSG) for training ResNet18 on CIFAR-10 dataset with different choices of