January  2020, 7(1): 79-96. doi: 10.3934/jdg.2020005

A regularization interpretation of the proximal point method for weakly convex functions

Department of Mathematics and Statistics, McGill University, Montreal, Canada

* Corresponding author: tim.hoheisel@mcgill.ca

Received  January 2019 Published  February 2020

Fund Project: The third author is supported by AFOSR FA9550-18-1-0167.

Empirical evidence and theoretical results suggest that the proximal point method can be computed approximately and still converge faster than the corresponding gradient descent method, in both the stochastic and exact gradient case. In this article we provide a perspective on this result by interpreting the method as gradient descent on a regularized function. This perspective applies in the case of weakly convex functions where proofs of the faster rates are not available. Using this analysis we find the optimal value of the regularization parameter in terms of the weak convexity.

Citation: Tim Hoheisel, Maxime Laborde, Adam Oberman. A regularization interpretation of the proximal point method for weakly convex functions. Journal of Dynamics & Games, 2020, 7 (1) : 79-96. doi: 10.3934/jdg.2020005
References:
[1]

Z. Allen-Zhu, Katyusha: The first direct acceleration of stochastic gradient methods, Journal of Machine Learning Research (JMLR), 18 (2017), 51 pp.  Google Scholar

[2]

L. T. H. An and P. D Tao, Convex analysis approach to d.c. programming: Theory, algorithms and applications, Acta Mathematica Vietnamica, 22 (1997), 289-355.   Google Scholar

[3]

H. Attouch and D. Azé, Approximation and regularization of arbitrary functions in Hilbert spaces by the Lasry-Lions method, Ann. Inst. H. Poincaré Anal. Non Linéaire, 10 (1993), 289-319.  doi: 10.1016/S0294-1449(16)30214-1.  Google Scholar

[4]

H. Attouch and J. Bolte, On the convergence of the proximal algorithm for nonsmooth functions involving analytic features, Math. Program., 116 (2009), 5-16.  doi: 10.1007/s10107-007-0133-5.  Google Scholar

[5]

J.-B. Baillon and G. Haddad, Quelques propriétés des opérateurs angle-bornés et $n$-cycliquement monotones, Israel J. Math., 26 (1977), 137-150.  doi: 10.1007/BF03007664.  Google Scholar

[6]

H. H. Bauschke and P. L. Combettes, The Baillon-Haddad theorem revisited, J. Convex Anal., 17 (2010), 781-787.   Google Scholar

[7]

H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces, CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC, Springer, New York, 2011. doi: 10.1007/978-1-4419-9467-7.  Google Scholar

[8]

A. Beck, First-order Methods in Optimization, MOS-SIAM Series on Optimization, 25. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, Mathematical Optimization Society, Philadelphia, PA, 2017. doi: 10.1137/1.9781611974997.ch1.  Google Scholar

[9]

L. BottouF. E. Curtis and J. Nocedal, Optimization methods for large-scale machine learning, SIAM Rev., 60 (2018), 223-311.  doi: 10.1137/16M1080173.  Google Scholar

[10]

S. Bubeck, Convex optimization: Algorithms and complexity, Found. Trends Mach. Learn., 8 (2015), 231-357.  doi: 10.1561/2200000050.  Google Scholar

[11]

Y. CarmonJ. C. DuchiO. Hinder and A. Sidford, Accelerated methods for nonconvex optimization, SIAM J. Optim., 28 (2018), 1751-1772.  doi: 10.1137/17M1114296.  Google Scholar

[12]

P. Chaudhari, A. Oberman, S. Osher, S. Soatto and G. Carlier, Deep Relaxation: Partial differential equations for optimizing deep neural networks, Res. Math. Sci., 5 (2018), 30 pp. doi: 10.1007/s40687-018-0148-y.  Google Scholar

[13]

L. C. Evans, Partial Differential Equations, Second edition, Graduate Studies in Mathematics, 19. American Mathematical Society, Providence, RI, 2010. doi: 978-0-8218-4974-3.  Google Scholar

[14]

N. Flammarion and F. Bach, From averaging to acceleration, there is only a step-size, Conference on Learning Theory, (2015), 658–695. Google Scholar

[15]

A. Kaplan and R. Tichatschke, Proximal point methods and nonconvex optimization, J. Global Optim., 13 (1998), 389-406.  doi: 10.1023/A:1008321423879.  Google Scholar

[16]

J.-M. Lasri and P.-L. Lions, A remark on regularization in Hilbert spaces, Israel J. Math., 55 (1986), 257-266.  doi: 10.1007/BF02765025.  Google Scholar

[17]

L. LessardB. Recht and A. Packard, Analysis and design of optimization algorithms via integral quadratic constraints, SIAM J. Optim., 26 (2016), 57-95.  doi: 10.1137/15M1009597.  Google Scholar

[18]

H. Lin, J. Mairal and Z. Harchaoui, A universal catalyst for first-order optimization, Advances in Neural Information Processing Systems 28, Curran Associates, Inc., (2015), 3384–3392. Google Scholar

[19]

B. Merlet and M. Pierre, Convergence to equilibrium for the backward Euler scheme and applications, Commun. Pure Appl. Anal., 9 (2010), 685-702.  doi: 10.3934/cpaa.2010.9.685.  Google Scholar

[20]

Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Applied Optimization, 87. Kluwer Academic Publishers, Boston, MA, 2004. doi: 10.1007/978-1-4419-8853-9.  Google Scholar

[21]

A. Nitanda, Stochastic proximal gradient descent with acceleration techniques, Advances in Neural Information Processing Systems 27, Curran Associates, Inc., (2014), 1574–1582. Google Scholar

[22]

C. PaquetteH. LinD. DrusvyatskiyJ. Mairal and Z. Harchaoui, Catalyst for gradient-based nonconvex optimization, Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, 84 (2018), 613-622.   Google Scholar

[23]

N. Parikh and S. Boyd, Proximal algorithms, Found. Trends Optim., 1 (2014), 127-239.  doi: 10.1561/2400000003.  Google Scholar

[24]

B. T. Polyak, Some methods of speeding up the convergence of iteration methods, Ž. Vyčisl. Mat i Fiz., 4 (1964), 791–803.  Google Scholar

[25]

R. T. Rockafellar and R. J.-B. Wets, Variational Analysis, Grundlehren der Mathematischen Wissenschaften, 317. Springer-Verlag, Berlin, 1998. doi: 10.1007/978-3-642-02431-3.  Google Scholar

[26]

L. Rosasco, S. Villa and B. C. Vũ, Convergence of stochastic proximal gradient algorithm, preprint, arXiv: 1403.5074. Google Scholar

[27]

M. Schmidt, N. Le Roux and F. Bach, Convergence rates of inexact proximal-gradient methods for convex optimization, Advances in Neural Information Processing Systems 24, Curran Associates, Inc., (2011), 1458–1466. Google Scholar

[28]

D. Scieur, V. Roulet, F. Bach and A. D'Aspremont, Integration methods and optimization algorithms, Proceedings of the 31st International Conference on Neural Information Processing Systems, Curran Associates Inc., (2017), 1109–1118. Google Scholar

[29]

A. M. Stuart and A. R. Humphries, Dynamical Systems and Numerical Analysis, Cambridge Monographs on Applied and Computational Mathematics, 2. Cambridge University Press, Cambridge, 1996.  Google Scholar

[30]

W. J. Su, S. Boyd and E. J. Candes, A differential equation for modeling Nesterov's accelerated gradient method: Theory and insights, J. Mach. Learn. Res., 17 (2016), 43 pp.  Google Scholar

[31]

P. H. YinM. PhamA. Oberman and S. Osher, Stochastic backward Euler: An implicit gradient descent algorithm for $k$-means clustering, J. Sci. Comput., 77 (2018), 1133-1146.  doi: 10.1007/s10915-018-0744-4.  Google Scholar

show all references

References:
[1]

Z. Allen-Zhu, Katyusha: The first direct acceleration of stochastic gradient methods, Journal of Machine Learning Research (JMLR), 18 (2017), 51 pp.  Google Scholar

[2]

L. T. H. An and P. D Tao, Convex analysis approach to d.c. programming: Theory, algorithms and applications, Acta Mathematica Vietnamica, 22 (1997), 289-355.   Google Scholar

[3]

H. Attouch and D. Azé, Approximation and regularization of arbitrary functions in Hilbert spaces by the Lasry-Lions method, Ann. Inst. H. Poincaré Anal. Non Linéaire, 10 (1993), 289-319.  doi: 10.1016/S0294-1449(16)30214-1.  Google Scholar

[4]

H. Attouch and J. Bolte, On the convergence of the proximal algorithm for nonsmooth functions involving analytic features, Math. Program., 116 (2009), 5-16.  doi: 10.1007/s10107-007-0133-5.  Google Scholar

[5]

J.-B. Baillon and G. Haddad, Quelques propriétés des opérateurs angle-bornés et $n$-cycliquement monotones, Israel J. Math., 26 (1977), 137-150.  doi: 10.1007/BF03007664.  Google Scholar

[6]

H. H. Bauschke and P. L. Combettes, The Baillon-Haddad theorem revisited, J. Convex Anal., 17 (2010), 781-787.   Google Scholar

[7]

H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces, CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC, Springer, New York, 2011. doi: 10.1007/978-1-4419-9467-7.  Google Scholar

[8]

A. Beck, First-order Methods in Optimization, MOS-SIAM Series on Optimization, 25. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, Mathematical Optimization Society, Philadelphia, PA, 2017. doi: 10.1137/1.9781611974997.ch1.  Google Scholar

[9]

L. BottouF. E. Curtis and J. Nocedal, Optimization methods for large-scale machine learning, SIAM Rev., 60 (2018), 223-311.  doi: 10.1137/16M1080173.  Google Scholar

[10]

S. Bubeck, Convex optimization: Algorithms and complexity, Found. Trends Mach. Learn., 8 (2015), 231-357.  doi: 10.1561/2200000050.  Google Scholar

[11]

Y. CarmonJ. C. DuchiO. Hinder and A. Sidford, Accelerated methods for nonconvex optimization, SIAM J. Optim., 28 (2018), 1751-1772.  doi: 10.1137/17M1114296.  Google Scholar

[12]

P. Chaudhari, A. Oberman, S. Osher, S. Soatto and G. Carlier, Deep Relaxation: Partial differential equations for optimizing deep neural networks, Res. Math. Sci., 5 (2018), 30 pp. doi: 10.1007/s40687-018-0148-y.  Google Scholar

[13]

L. C. Evans, Partial Differential Equations, Second edition, Graduate Studies in Mathematics, 19. American Mathematical Society, Providence, RI, 2010. doi: 978-0-8218-4974-3.  Google Scholar

[14]

N. Flammarion and F. Bach, From averaging to acceleration, there is only a step-size, Conference on Learning Theory, (2015), 658–695. Google Scholar

[15]

A. Kaplan and R. Tichatschke, Proximal point methods and nonconvex optimization, J. Global Optim., 13 (1998), 389-406.  doi: 10.1023/A:1008321423879.  Google Scholar

[16]

J.-M. Lasri and P.-L. Lions, A remark on regularization in Hilbert spaces, Israel J. Math., 55 (1986), 257-266.  doi: 10.1007/BF02765025.  Google Scholar

[17]

L. LessardB. Recht and A. Packard, Analysis and design of optimization algorithms via integral quadratic constraints, SIAM J. Optim., 26 (2016), 57-95.  doi: 10.1137/15M1009597.  Google Scholar

[18]

H. Lin, J. Mairal and Z. Harchaoui, A universal catalyst for first-order optimization, Advances in Neural Information Processing Systems 28, Curran Associates, Inc., (2015), 3384–3392. Google Scholar

[19]

B. Merlet and M. Pierre, Convergence to equilibrium for the backward Euler scheme and applications, Commun. Pure Appl. Anal., 9 (2010), 685-702.  doi: 10.3934/cpaa.2010.9.685.  Google Scholar

[20]

Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Applied Optimization, 87. Kluwer Academic Publishers, Boston, MA, 2004. doi: 10.1007/978-1-4419-8853-9.  Google Scholar

[21]

A. Nitanda, Stochastic proximal gradient descent with acceleration techniques, Advances in Neural Information Processing Systems 27, Curran Associates, Inc., (2014), 1574–1582. Google Scholar

[22]

C. PaquetteH. LinD. DrusvyatskiyJ. Mairal and Z. Harchaoui, Catalyst for gradient-based nonconvex optimization, Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, 84 (2018), 613-622.   Google Scholar

[23]

N. Parikh and S. Boyd, Proximal algorithms, Found. Trends Optim., 1 (2014), 127-239.  doi: 10.1561/2400000003.  Google Scholar

[24]

B. T. Polyak, Some methods of speeding up the convergence of iteration methods, Ž. Vyčisl. Mat i Fiz., 4 (1964), 791–803.  Google Scholar

[25]

R. T. Rockafellar and R. J.-B. Wets, Variational Analysis, Grundlehren der Mathematischen Wissenschaften, 317. Springer-Verlag, Berlin, 1998. doi: 10.1007/978-3-642-02431-3.  Google Scholar

[26]

L. Rosasco, S. Villa and B. C. Vũ, Convergence of stochastic proximal gradient algorithm, preprint, arXiv: 1403.5074. Google Scholar

[27]

M. Schmidt, N. Le Roux and F. Bach, Convergence rates of inexact proximal-gradient methods for convex optimization, Advances in Neural Information Processing Systems 24, Curran Associates, Inc., (2011), 1458–1466. Google Scholar

[28]

D. Scieur, V. Roulet, F. Bach and A. D'Aspremont, Integration methods and optimization algorithms, Proceedings of the 31st International Conference on Neural Information Processing Systems, Curran Associates Inc., (2017), 1109–1118. Google Scholar

[29]

A. M. Stuart and A. R. Humphries, Dynamical Systems and Numerical Analysis, Cambridge Monographs on Applied and Computational Mathematics, 2. Cambridge University Press, Cambridge, 1996.  Google Scholar

[30]

W. J. Su, S. Boyd and E. J. Candes, A differential equation for modeling Nesterov's accelerated gradient method: Theory and insights, J. Mach. Learn. Res., 17 (2016), 43 pp.  Google Scholar

[31]

P. H. YinM. PhamA. Oberman and S. Osher, Stochastic backward Euler: An implicit gradient descent algorithm for $k$-means clustering, J. Sci. Comput., 77 (2018), 1133-1146.  doi: 10.1007/s10915-018-0744-4.  Google Scholar

Figure 1.  Illustration of Example 1
Figure 2.  Top: Iterations of gradient descent and the proximal point method. The proximal point method gets closer to the global mininum with the same number of total gradient evaluations. Bottom: function values at each iteration of gradient descent and at each outer proximal point iteration for the Rosenbrock function (a fair comparison is used by counting total gradient evaluations, using 10 or 20 for the approximate proximal point). After 2000 gradient evaluations the function value for gradient descent is still order 1, while the proximal point method is order $ 10^{-2} $, moreover the function values appear to decrease at a first order rate
[1]

Gang Luo, Qingzhi Yang. The point-wise convergence of shifted symmetric higher order power method. Journal of Industrial & Management Optimization, 2021, 17 (1) : 357-368. doi: 10.3934/jimo.2019115

[2]

Xiu Ye, Shangyou Zhang, Peng Zhu. A weak Galerkin finite element method for nonlinear conservation laws. Electronic Research Archive, 2021, 29 (1) : 1897-1923. doi: 10.3934/era.2020097

[3]

Lateef Olakunle Jolaoso, Maggie Aphane. Bregman subgradient extragradient method with monotone self-adjustment stepsize for solving pseudo-monotone variational inequalities and fixed point problems. Journal of Industrial & Management Optimization, 2020  doi: 10.3934/jimo.2020178

[4]

Mehdi Bastani, Davod Khojasteh Salkuyeh. On the GSOR iteration method for image restoration. Numerical Algebra, Control & Optimization, 2021, 11 (1) : 27-43. doi: 10.3934/naco.2020013

[5]

Xiaoxiao Li, Yingjing Shi, Rui Li, Shida Cao. Energy management method for an unpowered landing. Journal of Industrial & Management Optimization, 2020  doi: 10.3934/jimo.2020180

[6]

Ying Liu, Yanping Chen, Yunqing Huang, Yang Wang. Two-grid method for semiconductor device problem by mixed finite element method and characteristics finite element method. Electronic Research Archive, 2021, 29 (1) : 1859-1880. doi: 10.3934/era.2020095

[7]

Hong Niu, Zhijiang Feng, Qijin Xiao, Yajun Zhang. A PID control method based on optimal control strategy. Numerical Algebra, Control & Optimization, 2021, 11 (1) : 117-126. doi: 10.3934/naco.2020019

[8]

Qing-Hu Hou, Yarong Wei. Telescoping method, summation formulas, and inversion pairs. Electronic Research Archive, , () : -. doi: 10.3934/era.2021007

[9]

Li-Bin Liu, Ying Liang, Jian Zhang, Xiaobing Bao. A robust adaptive grid method for singularly perturbed Burger-Huxley equations. Electronic Research Archive, 2020, 28 (4) : 1439-1457. doi: 10.3934/era.2020076

[10]

Zexuan Liu, Zhiyuan Sun, Jerry Zhijian Yang. A numerical study of superconvergence of the discontinuous Galerkin method by patch reconstruction. Electronic Research Archive, 2020, 28 (4) : 1487-1501. doi: 10.3934/era.2020078

[11]

Yuxia Guo, Shaolong Peng. A direct method of moving planes for fully nonlinear nonlocal operators and applications. Discrete & Continuous Dynamical Systems - S, 2020  doi: 10.3934/dcdss.2020462

[12]

Noah Stevenson, Ian Tice. A truncated real interpolation method and characterizations of screened Sobolev spaces. Communications on Pure & Applied Analysis, 2020, 19 (12) : 5509-5566. doi: 10.3934/cpaa.2020250

[13]

Yue Feng, Yujie Liu, Ruishu Wang, Shangyou Zhang. A conforming discontinuous Galerkin finite element method on rectangular partitions. Electronic Research Archive, , () : -. doi: 10.3934/era.2020120

[14]

Bing Yu, Lei Zhang. Global optimization-based dimer method for finding saddle points. Discrete & Continuous Dynamical Systems - B, 2021, 26 (1) : 741-753. doi: 10.3934/dcdsb.2020139

[15]

C. J. Price. A modified Nelder-Mead barrier method for constrained optimization. Numerical Algebra, Control & Optimization, 2020  doi: 10.3934/naco.2020058

[16]

Hongfei Yang, Xiaofeng Ding, Raymond Chan, Hui Hu, Yaxin Peng, Tieyong Zeng. A new initialization method based on normed statistical spaces in deep networks. Inverse Problems & Imaging, 2021, 15 (1) : 147-158. doi: 10.3934/ipi.2020045

[17]

Kai Zhang, Xiaoqi Yang, Song Wang. Solution method for discrete double obstacle problems based on a power penalty approach. Journal of Industrial & Management Optimization, 2020  doi: 10.3934/jimo.2021018

[18]

Ke Su, Yumeng Lin, Chun Xu. A new adaptive method to nonlinear semi-infinite programming. Journal of Industrial & Management Optimization, 2020  doi: 10.3934/jimo.2021012

[19]

Tengfei Yan, Qunying Liu, Bowen Dou, Qing Li, Bowen Li. An adaptive dynamic programming method for torque ripple minimization of PMSM. Journal of Industrial & Management Optimization, 2021, 17 (2) : 827-839. doi: 10.3934/jimo.2019136

[20]

Hui Gao, Jian Lv, Xiaoliang Wang, Liping Pang. An alternating linearization bundle method for a class of nonconvex optimization problem with inexact information. Journal of Industrial & Management Optimization, 2021, 17 (2) : 805-825. doi: 10.3934/jimo.2019135

 Impact Factor: 

Metrics

  • PDF downloads (136)
  • HTML views (210)
  • Cited by (0)

Other articles
by authors

[Back to Top]