Empirical evidence and theoretical results suggest that the proximal point method can be computed approximately and still converge faster than the corresponding gradient descent method, in both the stochastic and exact gradient case. In this article we provide a perspective on this result by interpreting the method as gradient descent on a regularized function. This perspective applies in the case of weakly convex functions where proofs of the faster rates are not available. Using this analysis we find the optimal value of the regularization parameter in terms of the weak convexity.
Citation: |
Figure 2. Top: Iterations of gradient descent and the proximal point method. The proximal point method gets closer to the global mininum with the same number of total gradient evaluations. Bottom: function values at each iteration of gradient descent and at each outer proximal point iteration for the Rosenbrock function (a fair comparison is used by counting total gradient evaluations, using 10 or 20 for the approximate proximal point). After 2000 gradient evaluations the function value for gradient descent is still order 1, while the proximal point method is order $ 10^{-2} $, moreover the function values appear to decrease at a first order rate
[1] |
Z. Allen-Zhu, Katyusha: The first direct acceleration of stochastic gradient methods, Journal of Machine Learning Research (JMLR), 18 (2017), 51 pp.
![]() ![]() |
[2] |
L. T. H. An and P. D Tao, Convex analysis approach to d.c. programming: Theory, algorithms and applications, Acta Mathematica Vietnamica, 22 (1997), 289-355.
![]() ![]() |
[3] |
H. Attouch and D. Azé, Approximation and regularization of arbitrary functions in Hilbert spaces by the Lasry-Lions method, Ann. Inst. H. Poincaré Anal. Non Linéaire, 10 (1993), 289-319.
doi: 10.1016/S0294-1449(16)30214-1.![]() ![]() ![]() |
[4] |
H. Attouch and J. Bolte, On the convergence of the proximal algorithm for nonsmooth functions involving analytic features, Math. Program., 116 (2009), 5-16.
doi: 10.1007/s10107-007-0133-5.![]() ![]() ![]() |
[5] |
J.-B. Baillon and G. Haddad, Quelques propriétés des opérateurs angle-bornés et $n$-cycliquement monotones, Israel J. Math., 26 (1977), 137-150.
doi: 10.1007/BF03007664.![]() ![]() ![]() |
[6] |
H. H. Bauschke and P. L. Combettes, The Baillon-Haddad theorem revisited, J. Convex Anal., 17 (2010), 781-787.
![]() ![]() |
[7] |
H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces, CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC, Springer, New York, 2011.
doi: 10.1007/978-1-4419-9467-7.![]() ![]() ![]() |
[8] |
A. Beck, First-order Methods in Optimization, MOS-SIAM Series on Optimization, 25. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, Mathematical Optimization Society, Philadelphia, PA, 2017.
doi: 10.1137/1.9781611974997.ch1.![]() ![]() ![]() |
[9] |
L. Bottou, F. E. Curtis and J. Nocedal, Optimization methods for large-scale machine learning, SIAM Rev., 60 (2018), 223-311.
doi: 10.1137/16M1080173.![]() ![]() ![]() |
[10] |
S. Bubeck, Convex optimization: Algorithms and complexity, Found. Trends Mach. Learn., 8 (2015), 231-357.
doi: 10.1561/2200000050.![]() ![]() |
[11] |
Y. Carmon, J. C. Duchi, O. Hinder and A. Sidford, Accelerated methods for nonconvex optimization, SIAM J. Optim., 28 (2018), 1751-1772.
doi: 10.1137/17M1114296.![]() ![]() ![]() |
[12] |
P. Chaudhari, A. Oberman, S. Osher, S. Soatto and G. Carlier, Deep Relaxation: Partial differential equations for optimizing deep neural networks, Res. Math. Sci., 5 (2018), 30 pp.
doi: 10.1007/s40687-018-0148-y.![]() ![]() ![]() |
[13] |
L. C. Evans, Partial Differential Equations, Second edition, Graduate Studies in Mathematics, 19. American Mathematical Society, Providence, RI, 2010.
doi: 978-0-8218-4974-3.![]() ![]() ![]() |
[14] |
N. Flammarion and F. Bach, From averaging to acceleration, there is only a step-size, Conference on Learning Theory, (2015), 658–695.
![]() |
[15] |
A. Kaplan and R. Tichatschke, Proximal point methods and nonconvex optimization, J. Global Optim., 13 (1998), 389-406.
doi: 10.1023/A:1008321423879.![]() ![]() ![]() |
[16] |
J.-M. Lasri and P.-L. Lions, A remark on regularization in Hilbert spaces, Israel J. Math., 55 (1986), 257-266.
doi: 10.1007/BF02765025.![]() ![]() ![]() |
[17] |
L. Lessard, B. Recht and A. Packard, Analysis and design of optimization algorithms via integral quadratic constraints, SIAM J. Optim., 26 (2016), 57-95.
doi: 10.1137/15M1009597.![]() ![]() ![]() |
[18] |
H. Lin, J. Mairal and Z. Harchaoui, A universal catalyst for first-order optimization, Advances in Neural Information Processing Systems 28, Curran Associates, Inc., (2015), 3384–3392.
![]() |
[19] |
B. Merlet and M. Pierre, Convergence to equilibrium for the backward Euler scheme and applications, Commun. Pure Appl. Anal., 9 (2010), 685-702.
doi: 10.3934/cpaa.2010.9.685.![]() ![]() ![]() |
[20] |
Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Applied Optimization, 87. Kluwer Academic Publishers, Boston, MA, 2004.
doi: 10.1007/978-1-4419-8853-9.![]() ![]() ![]() |
[21] |
A. Nitanda, Stochastic proximal gradient descent with acceleration techniques, Advances in Neural Information Processing Systems 27, Curran Associates, Inc., (2014), 1574–1582.
![]() |
[22] |
C. Paquette, H. Lin, D. Drusvyatskiy, J. Mairal and Z. Harchaoui, Catalyst for gradient-based nonconvex optimization, Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, 84 (2018), 613-622.
![]() |
[23] |
N. Parikh and S. Boyd, Proximal algorithms, Found. Trends Optim., 1 (2014), 127-239.
doi: 10.1561/2400000003.![]() ![]() |
[24] |
B. T. Polyak, Some methods of speeding up the convergence of iteration methods, Ž. Vyčisl. Mat i Fiz., 4 (1964), 791–803.
![]() ![]() |
[25] |
R. T. Rockafellar and R. J.-B. Wets, Variational Analysis, Grundlehren der Mathematischen Wissenschaften, 317. Springer-Verlag, Berlin, 1998.
doi: 10.1007/978-3-642-02431-3.![]() ![]() ![]() |
[26] |
L. Rosasco, S. Villa and B. C. Vũ, Convergence of stochastic proximal gradient algorithm, preprint, arXiv: 1403.5074.
![]() |
[27] |
M. Schmidt, N. Le Roux and F. Bach, Convergence rates of inexact proximal-gradient methods for convex optimization, Advances in Neural Information Processing Systems 24, Curran Associates, Inc., (2011), 1458–1466.
![]() |
[28] |
D. Scieur, V. Roulet, F. Bach and A. D'Aspremont, Integration methods and optimization algorithms, Proceedings of the 31st International Conference on Neural Information Processing Systems, Curran Associates Inc., (2017), 1109–1118.
![]() |
[29] |
A. M. Stuart and A. R. Humphries, Dynamical Systems and Numerical Analysis, Cambridge
Monographs on Applied and Computational Mathematics, 2. Cambridge University Press,
Cambridge, 1996.
![]() ![]() |
[30] |
W. J. Su, S. Boyd and E. J. Candes, A differential equation for modeling Nesterov's accelerated gradient method: Theory and insights, J. Mach. Learn. Res., 17 (2016), 43 pp.
![]() ![]() |
[31] |
P. H. Yin, M. Pham, A. Oberman and S. Osher, Stochastic backward Euler: An implicit gradient descent algorithm for $k$-means clustering, J. Sci. Comput., 77 (2018), 1133-1146.
doi: 10.1007/s10915-018-0744-4.![]() ![]() ![]() |
Illustration of Example 1
Top: Iterations of gradient descent and the proximal point method. The proximal point method gets closer to the global mininum with the same number of total gradient evaluations. Bottom: function values at each iteration of gradient descent and at each outer proximal point iteration for the Rosenbrock function (a fair comparison is used by counting total gradient evaluations, using 10 or 20 for the approximate proximal point). After 2000 gradient evaluations the function value for gradient descent is still order 1, while the proximal point method is order