January  2020, 7(1): 79-96. doi: 10.3934/jdg.2020005

A regularization interpretation of the proximal point method for weakly convex functions

Department of Mathematics and Statistics, McGill University, Montreal, Canada

* Corresponding author: tim.hoheisel@mcgill.ca

Received  January 2019 Published  February 2020

Fund Project: The third author is supported by AFOSR FA9550-18-1-0167.

Empirical evidence and theoretical results suggest that the proximal point method can be computed approximately and still converge faster than the corresponding gradient descent method, in both the stochastic and exact gradient case. In this article we provide a perspective on this result by interpreting the method as gradient descent on a regularized function. This perspective applies in the case of weakly convex functions where proofs of the faster rates are not available. Using this analysis we find the optimal value of the regularization parameter in terms of the weak convexity.

Citation: Tim Hoheisel, Maxime Laborde, Adam Oberman. A regularization interpretation of the proximal point method for weakly convex functions. Journal of Dynamics & Games, 2020, 7 (1) : 79-96. doi: 10.3934/jdg.2020005
References:
[1]

Z. Allen-Zhu, Katyusha: The first direct acceleration of stochastic gradient methods, Journal of Machine Learning Research (JMLR), 18 (2017), 51 pp.  Google Scholar

[2]

L. T. H. An and P. D Tao, Convex analysis approach to d.c. programming: Theory, algorithms and applications, Acta Mathematica Vietnamica, 22 (1997), 289-355.   Google Scholar

[3]

H. Attouch and D. Azé, Approximation and regularization of arbitrary functions in Hilbert spaces by the Lasry-Lions method, Ann. Inst. H. Poincaré Anal. Non Linéaire, 10 (1993), 289-319.  doi: 10.1016/S0294-1449(16)30214-1.  Google Scholar

[4]

H. Attouch and J. Bolte, On the convergence of the proximal algorithm for nonsmooth functions involving analytic features, Math. Program., 116 (2009), 5-16.  doi: 10.1007/s10107-007-0133-5.  Google Scholar

[5]

J.-B. Baillon and G. Haddad, Quelques propriétés des opérateurs angle-bornés et $n$-cycliquement monotones, Israel J. Math., 26 (1977), 137-150.  doi: 10.1007/BF03007664.  Google Scholar

[6]

H. H. Bauschke and P. L. Combettes, The Baillon-Haddad theorem revisited, J. Convex Anal., 17 (2010), 781-787.   Google Scholar

[7]

H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces, CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC, Springer, New York, 2011. doi: 10.1007/978-1-4419-9467-7.  Google Scholar

[8]

A. Beck, First-order Methods in Optimization, MOS-SIAM Series on Optimization, 25. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, Mathematical Optimization Society, Philadelphia, PA, 2017. doi: 10.1137/1.9781611974997.ch1.  Google Scholar

[9]

L. BottouF. E. Curtis and J. Nocedal, Optimization methods for large-scale machine learning, SIAM Rev., 60 (2018), 223-311.  doi: 10.1137/16M1080173.  Google Scholar

[10]

S. Bubeck, Convex optimization: Algorithms and complexity, Found. Trends Mach. Learn., 8 (2015), 231-357.  doi: 10.1561/2200000050.  Google Scholar

[11]

Y. CarmonJ. C. DuchiO. Hinder and A. Sidford, Accelerated methods for nonconvex optimization, SIAM J. Optim., 28 (2018), 1751-1772.  doi: 10.1137/17M1114296.  Google Scholar

[12]

P. Chaudhari, A. Oberman, S. Osher, S. Soatto and G. Carlier, Deep Relaxation: Partial differential equations for optimizing deep neural networks, Res. Math. Sci., 5 (2018), 30 pp. doi: 10.1007/s40687-018-0148-y.  Google Scholar

[13]

L. C. Evans, Partial Differential Equations, Second edition, Graduate Studies in Mathematics, 19. American Mathematical Society, Providence, RI, 2010. doi: 978-0-8218-4974-3.  Google Scholar

[14]

N. Flammarion and F. Bach, From averaging to acceleration, there is only a step-size, Conference on Learning Theory, (2015), 658–695. Google Scholar

[15]

A. Kaplan and R. Tichatschke, Proximal point methods and nonconvex optimization, J. Global Optim., 13 (1998), 389-406.  doi: 10.1023/A:1008321423879.  Google Scholar

[16]

J.-M. Lasri and P.-L. Lions, A remark on regularization in Hilbert spaces, Israel J. Math., 55 (1986), 257-266.  doi: 10.1007/BF02765025.  Google Scholar

[17]

L. LessardB. Recht and A. Packard, Analysis and design of optimization algorithms via integral quadratic constraints, SIAM J. Optim., 26 (2016), 57-95.  doi: 10.1137/15M1009597.  Google Scholar

[18]

H. Lin, J. Mairal and Z. Harchaoui, A universal catalyst for first-order optimization, Advances in Neural Information Processing Systems 28, Curran Associates, Inc., (2015), 3384–3392. Google Scholar

[19]

B. Merlet and M. Pierre, Convergence to equilibrium for the backward Euler scheme and applications, Commun. Pure Appl. Anal., 9 (2010), 685-702.  doi: 10.3934/cpaa.2010.9.685.  Google Scholar

[20]

Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Applied Optimization, 87. Kluwer Academic Publishers, Boston, MA, 2004. doi: 10.1007/978-1-4419-8853-9.  Google Scholar

[21]

A. Nitanda, Stochastic proximal gradient descent with acceleration techniques, Advances in Neural Information Processing Systems 27, Curran Associates, Inc., (2014), 1574–1582. Google Scholar

[22]

C. PaquetteH. LinD. DrusvyatskiyJ. Mairal and Z. Harchaoui, Catalyst for gradient-based nonconvex optimization, Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, 84 (2018), 613-622.   Google Scholar

[23]

N. Parikh and S. Boyd, Proximal algorithms, Found. Trends Optim., 1 (2014), 127-239.  doi: 10.1561/2400000003.  Google Scholar

[24]

B. T. Polyak, Some methods of speeding up the convergence of iteration methods, Ž. Vyčisl. Mat i Fiz., 4 (1964), 791–803.  Google Scholar

[25]

R. T. Rockafellar and R. J.-B. Wets, Variational Analysis, Grundlehren der Mathematischen Wissenschaften, 317. Springer-Verlag, Berlin, 1998. doi: 10.1007/978-3-642-02431-3.  Google Scholar

[26]

L. Rosasco, S. Villa and B. C. Vũ, Convergence of stochastic proximal gradient algorithm, preprint, arXiv: 1403.5074. Google Scholar

[27]

M. Schmidt, N. Le Roux and F. Bach, Convergence rates of inexact proximal-gradient methods for convex optimization, Advances in Neural Information Processing Systems 24, Curran Associates, Inc., (2011), 1458–1466. Google Scholar

[28]

D. Scieur, V. Roulet, F. Bach and A. D'Aspremont, Integration methods and optimization algorithms, Proceedings of the 31st International Conference on Neural Information Processing Systems, Curran Associates Inc., (2017), 1109–1118. Google Scholar

[29]

A. M. Stuart and A. R. Humphries, Dynamical Systems and Numerical Analysis, Cambridge Monographs on Applied and Computational Mathematics, 2. Cambridge University Press, Cambridge, 1996.  Google Scholar

[30]

W. J. Su, S. Boyd and E. J. Candes, A differential equation for modeling Nesterov's accelerated gradient method: Theory and insights, J. Mach. Learn. Res., 17 (2016), 43 pp.  Google Scholar

[31]

P. H. YinM. PhamA. Oberman and S. Osher, Stochastic backward Euler: An implicit gradient descent algorithm for $k$-means clustering, J. Sci. Comput., 77 (2018), 1133-1146.  doi: 10.1007/s10915-018-0744-4.  Google Scholar

show all references

References:
[1]

Z. Allen-Zhu, Katyusha: The first direct acceleration of stochastic gradient methods, Journal of Machine Learning Research (JMLR), 18 (2017), 51 pp.  Google Scholar

[2]

L. T. H. An and P. D Tao, Convex analysis approach to d.c. programming: Theory, algorithms and applications, Acta Mathematica Vietnamica, 22 (1997), 289-355.   Google Scholar

[3]

H. Attouch and D. Azé, Approximation and regularization of arbitrary functions in Hilbert spaces by the Lasry-Lions method, Ann. Inst. H. Poincaré Anal. Non Linéaire, 10 (1993), 289-319.  doi: 10.1016/S0294-1449(16)30214-1.  Google Scholar

[4]

H. Attouch and J. Bolte, On the convergence of the proximal algorithm for nonsmooth functions involving analytic features, Math. Program., 116 (2009), 5-16.  doi: 10.1007/s10107-007-0133-5.  Google Scholar

[5]

J.-B. Baillon and G. Haddad, Quelques propriétés des opérateurs angle-bornés et $n$-cycliquement monotones, Israel J. Math., 26 (1977), 137-150.  doi: 10.1007/BF03007664.  Google Scholar

[6]

H. H. Bauschke and P. L. Combettes, The Baillon-Haddad theorem revisited, J. Convex Anal., 17 (2010), 781-787.   Google Scholar

[7]

H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces, CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC, Springer, New York, 2011. doi: 10.1007/978-1-4419-9467-7.  Google Scholar

[8]

A. Beck, First-order Methods in Optimization, MOS-SIAM Series on Optimization, 25. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, Mathematical Optimization Society, Philadelphia, PA, 2017. doi: 10.1137/1.9781611974997.ch1.  Google Scholar

[9]

L. BottouF. E. Curtis and J. Nocedal, Optimization methods for large-scale machine learning, SIAM Rev., 60 (2018), 223-311.  doi: 10.1137/16M1080173.  Google Scholar

[10]

S. Bubeck, Convex optimization: Algorithms and complexity, Found. Trends Mach. Learn., 8 (2015), 231-357.  doi: 10.1561/2200000050.  Google Scholar

[11]

Y. CarmonJ. C. DuchiO. Hinder and A. Sidford, Accelerated methods for nonconvex optimization, SIAM J. Optim., 28 (2018), 1751-1772.  doi: 10.1137/17M1114296.  Google Scholar

[12]

P. Chaudhari, A. Oberman, S. Osher, S. Soatto and G. Carlier, Deep Relaxation: Partial differential equations for optimizing deep neural networks, Res. Math. Sci., 5 (2018), 30 pp. doi: 10.1007/s40687-018-0148-y.  Google Scholar

[13]

L. C. Evans, Partial Differential Equations, Second edition, Graduate Studies in Mathematics, 19. American Mathematical Society, Providence, RI, 2010. doi: 978-0-8218-4974-3.  Google Scholar

[14]

N. Flammarion and F. Bach, From averaging to acceleration, there is only a step-size, Conference on Learning Theory, (2015), 658–695. Google Scholar

[15]

A. Kaplan and R. Tichatschke, Proximal point methods and nonconvex optimization, J. Global Optim., 13 (1998), 389-406.  doi: 10.1023/A:1008321423879.  Google Scholar

[16]

J.-M. Lasri and P.-L. Lions, A remark on regularization in Hilbert spaces, Israel J. Math., 55 (1986), 257-266.  doi: 10.1007/BF02765025.  Google Scholar

[17]

L. LessardB. Recht and A. Packard, Analysis and design of optimization algorithms via integral quadratic constraints, SIAM J. Optim., 26 (2016), 57-95.  doi: 10.1137/15M1009597.  Google Scholar

[18]

H. Lin, J. Mairal and Z. Harchaoui, A universal catalyst for first-order optimization, Advances in Neural Information Processing Systems 28, Curran Associates, Inc., (2015), 3384–3392. Google Scholar

[19]

B. Merlet and M. Pierre, Convergence to equilibrium for the backward Euler scheme and applications, Commun. Pure Appl. Anal., 9 (2010), 685-702.  doi: 10.3934/cpaa.2010.9.685.  Google Scholar

[20]

Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Applied Optimization, 87. Kluwer Academic Publishers, Boston, MA, 2004. doi: 10.1007/978-1-4419-8853-9.  Google Scholar

[21]

A. Nitanda, Stochastic proximal gradient descent with acceleration techniques, Advances in Neural Information Processing Systems 27, Curran Associates, Inc., (2014), 1574–1582. Google Scholar

[22]

C. PaquetteH. LinD. DrusvyatskiyJ. Mairal and Z. Harchaoui, Catalyst for gradient-based nonconvex optimization, Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, 84 (2018), 613-622.   Google Scholar

[23]

N. Parikh and S. Boyd, Proximal algorithms, Found. Trends Optim., 1 (2014), 127-239.  doi: 10.1561/2400000003.  Google Scholar

[24]

B. T. Polyak, Some methods of speeding up the convergence of iteration methods, Ž. Vyčisl. Mat i Fiz., 4 (1964), 791–803.  Google Scholar

[25]

R. T. Rockafellar and R. J.-B. Wets, Variational Analysis, Grundlehren der Mathematischen Wissenschaften, 317. Springer-Verlag, Berlin, 1998. doi: 10.1007/978-3-642-02431-3.  Google Scholar

[26]

L. Rosasco, S. Villa and B. C. Vũ, Convergence of stochastic proximal gradient algorithm, preprint, arXiv: 1403.5074. Google Scholar

[27]

M. Schmidt, N. Le Roux and F. Bach, Convergence rates of inexact proximal-gradient methods for convex optimization, Advances in Neural Information Processing Systems 24, Curran Associates, Inc., (2011), 1458–1466. Google Scholar

[28]

D. Scieur, V. Roulet, F. Bach and A. D'Aspremont, Integration methods and optimization algorithms, Proceedings of the 31st International Conference on Neural Information Processing Systems, Curran Associates Inc., (2017), 1109–1118. Google Scholar

[29]

A. M. Stuart and A. R. Humphries, Dynamical Systems and Numerical Analysis, Cambridge Monographs on Applied and Computational Mathematics, 2. Cambridge University Press, Cambridge, 1996.  Google Scholar

[30]

W. J. Su, S. Boyd and E. J. Candes, A differential equation for modeling Nesterov's accelerated gradient method: Theory and insights, J. Mach. Learn. Res., 17 (2016), 43 pp.  Google Scholar

[31]

P. H. YinM. PhamA. Oberman and S. Osher, Stochastic backward Euler: An implicit gradient descent algorithm for $k$-means clustering, J. Sci. Comput., 77 (2018), 1133-1146.  doi: 10.1007/s10915-018-0744-4.  Google Scholar

Figure 1.  Illustration of Example 1
Figure 2.  Top: Iterations of gradient descent and the proximal point method. The proximal point method gets closer to the global mininum with the same number of total gradient evaluations. Bottom: function values at each iteration of gradient descent and at each outer proximal point iteration for the Rosenbrock function (a fair comparison is used by counting total gradient evaluations, using 10 or 20 for the approximate proximal point). After 2000 gradient evaluations the function value for gradient descent is still order 1, while the proximal point method is order $ 10^{-2} $, moreover the function values appear to decrease at a first order rate
[1]

Igor Griva, Roman A. Polyak. Proximal point nonlinear rescaling method for convex optimization. Numerical Algebra, Control & Optimization, 2011, 1 (2) : 283-299. doi: 10.3934/naco.2011.1.283

[2]

Wolf-Jürgen Beyn, Raphael Kruse. Two-sided error estimates for the stochastic theta method. Discrete & Continuous Dynamical Systems - B, 2010, 14 (2) : 389-407. doi: 10.3934/dcdsb.2010.14.389

[3]

Foxiang Liu, Lingling Xu, Yuehong Sun, Deren Han. A proximal alternating direction method for multi-block coupled convex optimization. Journal of Industrial & Management Optimization, 2019, 15 (2) : 723-737. doi: 10.3934/jimo.2018067

[4]

Chunming Tang, Jinbao Jian, Guoyin Li. A proximal-projection partial bundle method for convex constrained minimax problems. Journal of Industrial & Management Optimization, 2019, 15 (2) : 757-774. doi: 10.3934/jimo.2018069

[5]

Rongjie Lai, Jiang Liang, Hong-Kai Zhao. A local mesh method for solving PDEs on point clouds. Inverse Problems & Imaging, 2013, 7 (3) : 737-755. doi: 10.3934/ipi.2013.7.737

[6]

Zhongyi Huang. Tailored finite point method for the interface problem. Networks & Heterogeneous Media, 2009, 4 (1) : 91-106. doi: 10.3934/nhm.2009.4.91

[7]

Xiao Ding, Deren Han. A modification of the forward-backward splitting method for maximal monotone mappings. Numerical Algebra, Control & Optimization, 2013, 3 (2) : 295-307. doi: 10.3934/naco.2013.3.295

[8]

Yulan Lu, Minghui Song, Mingzhu Liu. Convergence rate and stability of the split-step theta method for stochastic differential equations with piecewise continuous arguments. Discrete & Continuous Dynamical Systems - B, 2019, 24 (2) : 695-717. doi: 10.3934/dcdsb.2018203

[9]

Kangkang Deng, Zheng Peng, Jianli Chen. Sparse probabilistic Boolean network problems: A partial proximal-type operator splitting method. Journal of Industrial & Management Optimization, 2019, 15 (4) : 1881-1896. doi: 10.3934/jimo.2018127

[10]

Hssaine Boualam, Ahmed Roubi. Dual algorithms based on the proximal bundle method for solving convex minimax fractional programs. Journal of Industrial & Management Optimization, 2019, 15 (4) : 1897-1920. doi: 10.3934/jimo.2018128

[11]

Min Tang. Second order all speed method for the isentropic Euler equations. Kinetic & Related Models, 2012, 5 (1) : 155-184. doi: 10.3934/krm.2012.5.155

[12]

Leonardi Filippo. A projection method for the computation of admissible measure valued solutions of the incompressible Euler equations. Discrete & Continuous Dynamical Systems - S, 2018, 11 (5) : 941-961. doi: 10.3934/dcdss.2018056

[13]

Weiyin Fei, Liangjian Hu, Xuerong Mao, Dengfeng Xia. Advances in the truncated Euler–Maruyama method for stochastic differential delay equations. Communications on Pure & Applied Analysis, 2020, 19 (4) : 2081-2100. doi: 10.3934/cpaa.2020092

[14]

Yanqun Liu. An exterior point linear programming method based on inclusive normal cones. Journal of Industrial & Management Optimization, 2010, 6 (4) : 825-846. doi: 10.3934/jimo.2010.6.825

[15]

Behrouz Kheirfam, Guoqiang Wang. An infeasible full NT-step interior point method for circular optimization. Numerical Algebra, Control & Optimization, 2017, 7 (2) : 171-184. doi: 10.3934/naco.2017011

[16]

Gang Luo, Qingzhi Yang. The point-wise convergence of shifted symmetric higher order power method. Journal of Industrial & Management Optimization, 2017, 13 (5) : 0-0. doi: 10.3934/jimo.2019115

[17]

Yue Zheng, Zhongping Wan, Shihui Jia, Guangmin Wang. A new method for strong-weak linear bilevel programming problem. Journal of Industrial & Management Optimization, 2015, 11 (2) : 529-547. doi: 10.3934/jimo.2015.11.529

[18]

Andrea L. Bertozzi, Ning Ju, Hsiang-Wei Lu. A biharmonic-modified forward time stepping method for fourth order nonlinear diffusion equations. Discrete & Continuous Dynamical Systems - A, 2011, 29 (4) : 1367-1391. doi: 10.3934/dcds.2011.29.1367

[19]

Feishe Chen, Lixin Shen, Yuesheng Xu, Xueying Zeng. The Moreau envelope approach for the L1/TV image denoising model. Inverse Problems & Imaging, 2014, 8 (1) : 53-77. doi: 10.3934/ipi.2014.8.53

[20]

Yunhai Xiao, Soon-Yi Wu, Bing-Sheng He. A proximal alternating direction method for $\ell_{2,1}$-norm least squares problem in multi-task feature learning. Journal of Industrial & Management Optimization, 2012, 8 (4) : 1057-1069. doi: 10.3934/jimo.2012.8.1057

 Impact Factor: 

Metrics

  • PDF downloads (26)
  • HTML views (65)
  • Cited by (0)

Other articles
by authors

[Back to Top]