A REGULARIZATION INTERPRETATION OF THE PROXIMAL POINT METHOD FOR WEAKLY CONVEX FUNCTIONS

. Empirical evidence and theoretical results suggest that the proximal point method can be computed approximately and still converge faster than the corresponding gradient descent method, in both the stochastic and exact gradient case. In this article we provide a perspective on this result by interpreting the method as gradient descent on a regularized function. This perspective applies in the case of weakly convex functions where proofs of the faster rates are not available. Using this analysis we ﬁnd the optimal value of the regularization parameter in terms of the weak convexity.


1.
Introduction. The importance of large scale optimization problems in machine learning has led to a resurgence of interest in first order optimization methods [8], and stochastic gradient descent in particular [9]. Proximal point methods [23] are alternative to gradient descent methods, which first came to use in the setting where the proximal mapping can be computed exactly. Later, they were used in the stochastic setting where the proximal mapping can only be computed approximately [18]. When the proximal point method parameter is tuned correctly, the proximal point method can converge faster than the corresponding stochastic gradient descent method [27] [26] [11]. However the optimal choice of the parameter depends on convexity parameters for the objective, which may not be available. Insight into the method is provided by a regularization interpretation: in [12] the method was interpreted as gradient descent for a regularized objective function, which was the solution of a partial differential equation. The interpretation was also used to apply and tune the method in [31].
A heuristic explanation for the method motivated the implementation in [18]: the proximal point method with parameter λ corresponds to implicit gradient descent with time step λ, which has a corresponding convergence rate. The convergence rate is much slower for stochastic gradient descent. However the proximal point method can be solved approximately, in a small number of iterations, even using stochastic gradients, since it is a strongly convex optimization problem. Thus stochastic proximal point method with parameter λ can converge as fast as exact gradient descent with time step λ. However, the challenge is to tune the parameter λ which depends on the unavailable weak-convexity parameter of the objective function.
While many problems are non-convex, the model problem for analysis is convex, which allows for global analysis of convergence rates. Since many problems are ill-conditioned, accelerated methods [24] [20] [10] are used to improve convergence rates. Proximal stochastic methods can also be accelerated [22] [21] [18] (direct stochastic acceleration methods are also available [1]).
Polyak's method [24] provably accelerates strongly convex quadratic functions, but not general convex functions. However it has the advantage of a simple interpretation as the explicit Forward Euler discretization of a second order ODE. On the other hand, Nesterov's method provably accelerates convex functions, but defies such a simple interpretation. The influential paper [30] provided an interpretation of Nesterov's method as the discretization of an ordinary differential equation (ODE), but with the gradients evaluated at an non-standard point in time. This interpretation was further studied in the quadratic case in [17] and [14] as well as [28], using linear stability analysis. However, while linear stability analysis provides insight locally, it does not apply globally to convex functions.
In this paper we study two aspects of proximal methods. The first is a method which interpolates between gradient descent and the proximal point method. This may give insight into Nesterov's method, which involves a gradient evaluated at an intermediate point. In connection with the regularization interpretation of [12], we study the convex analytical properties of the regularized function corresponding the proximal point method, in the weakly convex case. We also study the optimal parameter for the proximal step, in terms of the weak convexity of the objective function. In the article we focus on exact gradients rather than stochastic gradients. In this simpler setting we can study the weakly convex case using tools from convex analysis and optimization.
Both the proximal point and the gradient descent methods can be interpreted as a time discretization of the Ordinary Differential Equation (ODE) When the ODE is discretized using either the forward or backward Euler method, the resulting algorithm corresponds to the gradient descent and the proximal point method as we explain below. Our starting point is a one-parameter family of discretizations of the which appears in the numerical study of ODEs as the θ-method, cf. [29]. These methods are numerical discretizations of (GD-ODE) which interpolate between gradient descent (for θ = 0) and (for θ = 1). The proximal point methods require the solution of a strongly convex optimization problem at each step, but allow for much longer time steps. We can also consider the non-differentiable case, where ∇f (x) is replaced by a subdifferential (regular/limiting/Clarke), see e.g. Section 2 or [25,Chapter 8], and we get the differential inclusion Notation. throughout the paper, the notation used is standard and widely consistent with the one used in [25]. However, here we use · to denote the Euclidean norm.

PROXIMAL POINT METHOD FOR WEAKLY CONVEX FUNCTIONS 81
For f : In order to indicate that a function F maps vectors in R n to subsets in R m we write F : R n ⇒ R m and call F set-valued. The domain of F is defined by 2. Preliminaries. We first recall standard concepts from nonsmooth analysis here, see [25]. A function f : If f : R n → R ∪ {+∞} is closed, proper, convex it is well known that we have cf. e.g. [25,Proposition 8.12]. For f : R n → R its (Fenchel) conjugate is the function f * : R n → R defined by If f is proper and has an affine minorant its conjugate f * is always closed, proper, convex, see e.g. [25,Theorem 11.1] and notice that f is proper and has an affine minorant if and only if its convex hull is proper. For f : R n → R ∪ {+∞} closed, proper, convex, the subdifferential and the conjugate function interact in the following way: see e.g. [25,Proposition 11.3]. Given f : R n → R ∪ {+∞} and λ > 0, the proximal mapping or prox-operator is the set-valued map P λ f : R n ⇒ R n defined by while the Moreau envelope e λ f : R n → R is given by Definition 2.1. The θ-method for (GD-ODE) corresponds to the time discretization where λ is the time step. When θ = 0, 1, the θ-method is called the explicit, implicit Euler method, respectively.
Note that we can generalize (4) to the nonsmooth case by The θ-method from (4) and (6), respectively can be recovered from a proximal point-type iteration.
While the θ-method is implicit for θ = 0 (meaning it requires the solution of a nonlinear equation or nonlinear optimization problem to find x k+1 ), we can rewrite it as the gradient descent method on a modified function. In fact, defining the θ-Moreau envelope (see Section 3.2 for more details) as we show below, for weakly convex f (see Section 2.2), the sequence (7) is also equivalent to Remark 1 (PDE interpretation). Our analysis of the θ-Moreau envelope is based on direct arguments. An alternative approach is using the Hamilton-Jacobi PDE.
It can be shown that the θ-Moreau envelope u θ (x, λ) = v(x, λ) where v(x, t) is the weak (viscosity) solution of the Hamilton-Jacobi equation . In the special case θ = 1, we recover the standard Hamilton-Jacobi equation for the Moreau envelope see [13].

2.2.
Weakly convex functions. The proximal point algorithm is based on the fixed-point iteration for some λ > 0. This is in essence only tractable if the subproblem for computing the prox-operator is convex, and ideally has a unique solution. A natural class of functions that does this is the following; see also Proposition 3.

Definition 2.3 (Weakly and strongly convex functions). A function
A c-weakly convex function with c < 0 is called c-strongly convex.
Clearly, Γ 0 is the cone of closed, proper, convex functions. Moreover, we have Weakly convex functions can be further generalized to the class of lower-C 2 functions [25, Definition 10.29, Theorem 10.33], and many of the ideas and results in the sequel will hold for these kinds of functions too, but for simplicity we confine ourselves to weakly convex ones.
The class of weakly convex functions also contains the Lasry-Lions regularization [16], This regularization is a C 1,1 function and, in [3], the authors show that any lower semi-continuous function defined on a Hilbert space, quadratically minorized can be approximate by the Lasry-Lions regularization.
We also point the reader to [15, Proposition 1] for another class of functions which are weakly convex restricted to some open set.
The central property of weakly convex functions is that if we add a "large enough" strongly convex term, the sum becomes strongly convex, hence both coercive, i.e.
in particular, level-bounded and also strictly convex. We state this formally below.
Lemma 2.4. Let c > 0 and f ∈ Γ c . Then function is strongly convex, hence coercive and strictly convex.
Proof. Strong convexity is clear (by the c-weak convexity of f ) and implies both coercivity (using an affine minorization argument, see [25,Proposition 8.12]) and obviously strict convexity.
The next result is clear from an elementary sum rule.
Proof. See e.g. [25,Exercise 10.10] for the representation of the subdifferential. The remainder follows from that and the fact that the respective statements hold for convex functions.
We also point out that weakly convex functions are Clarke regular (see Definition [25,Definition 7.25]) hence their regular and limiting subdifferential coincide. In particular, for a (finite-valued) weakly convex functions, the (regular) subdifferential is equal to Clarke's subdifferential, i.e. can be computed as where D f is the (full measure) set of differentiability of f and conv is the convex hull-operator. We use (11) in Example 1.
The following lemma gives a generalized CFL condition for to solve (7).
Lemma 2.5. Suppose f is c-weakly convex. Then x k+1 , solution of (7), can be found as the solution of a convex optimization problem, provided λ, θ > 0 satisfy the following (generalized CFL condition/time step restriction): Proof. To prove the CFL condition, notice that, since f is c-weakly convex, for all x ∈ R n , the function We recall the central duality result for DC optimization.
Proposition 2 (Toland-Singer duality). Let g, h ∈ Γ 0 . Then the following hold: We point out that item a) and b) in Proposition 2 remain valid even if the convexity of g is dropped.
3. The prox-operator and Moreau envelope for weakly convex functions.
3.1. The Moreau envelopes. In this section we study the Moreau envelope and proximal mapping for weakly convex functions. Many of the properties follow from more general results in variational analysis and montone operator theory, see [7,25]. We will point out where this is the case. However, we present a vastly self-contained account only built on convex analysis (except when the nonconvex subdifferential is involved) and improve some of the existing results along the way.
Throughout this section we use the following: . Proposition 3 (Prox-operator of weakly convex functions). Let f and φ λ as in Assumption 1. Then the following hold: the critical points of f are exactly the fixed points of the prox-operator of P λ f .
Proof. a) By definition we have, for all x ∈ R n , The function u → φ λ (u) − 1 λ x, u is strongly convex for every x ∈ R n , see Lemma 2.4. Hence, the argmin set above is always a singleton. b) We have where the second equivalence uses the convexity of φ λ and the third one follows from the consideration above in a). This proves the first equivalence in b). The second one then follows from (3).
Here the first equivalence is due to Proposition 1. Part b) now gives the claim.
Note that part b) is in a similar form given in [25,Proposition 12.19].
The following result constitutes a slight generalization of [7, Proposition 12.26] and its self-contained proof follows the same pattern.
Lemma 3.1. Let c > 0 and f ∈ Γ c , x ∈ R n , 0 < λc < 1, and put p := P λ f (x). Then Proof. Let y ∈ R n and φ λ defined by (10). Using the 1 λ − c -strong convexity of φ λ and noticing that x λ ∈ ∂φ λ (p) by Proposition 3 b), we obtain Therefore, we have which is equivalent to From Lemma 3.1 we infer the following property of the prox-operator, which in the literature is known as cocoercivity, and which can be derived as a consequence of the Baillon-Haddad Theorem [5,6,7] as φ λ is ( 1 λ − c)-strongly convex, and ∇φ * λ = P λ f (λ(·)), see Proposition 3. Our proof is, however, self-contained as it only builds on Lemma 3.1 which itself is self-contained.
Proposition 4 (Cocoercivity of the prox-operator). Let c > 0 and f ∈ Γ c . Then for 0 < λc < 1 we have Proof. Let x, y ∈ R n and put p := P λ f (x) and q := P λ f (y). By Lemma 3.1 we have Adding the above inequalities yields Rearranging gives the desired inequality.
As an immediate consequence of Proposition 4 we recover the well-known result, see [25,Proposition 12.19] that P λ f is 1 1−λc -Lipschitz continuous for any f ∈ Γ c and 0 < cλ.
We now turn our attention to the Moreau envelope. We point out that the Lipschitz constant for the gradient of the Moreau envelope is, to the best of our knowledge, sharper than what can be found in the literature.
The Lipschitz modulus can be seen as follows: By Proposition 4, we have where p = P λ f (x) and q = P λ f (y). Now observe that First, considering the case 1 2 ≥ cλ, as p − q, x − y ≥ 0 (cf. Proposition 4), we thus have (x − p) − (y − q) 2 ≤ x − y 2 . On the other hand, for 1 2 ≤ cλ, we can continue the sequence of inequalities from above using Proposition 4 and Cauchy-Schwarz to find All in all, putting which proves the desired Lipschitz constant.

e) Follows from b) and Proposition 3 c).
The fact in Corollary 1 c) and d) that the optimal value and minimizers, respectively, of f and its Moreau envelope coincide is well-known, and valid under even weaker assumptions, see [25,Example 1.46]. However, our technique of proof via DC duality theory is novel and remains in the convex realm and merits presentation of said proof.
Remark 2 (Optimal parameter choice for λ). Corollary 1 b) provides us with an "optimal choice" for the parameter λ: Suppose that Then λ = 1 2c yields L = 2c which is as small as the Lipschitz constant can be for a given f .
In view of the Lipschitz constant for ∇e λ f derived in Corollary 1 b) the question as to whether this constant can be improved generally in the class Γ c arises naturally. The following example gives an illustration of Corollary 1 and also provides a negative answer to this question, in that it presents a Γ c -function for which the Lipschitz constant provided by Corollary 1 is sharp in either case.
Then, clearly, f is a-weakly convex. Using (11) we find that Therefore we have 3.2. The θ-envelopes. We now generalize the notion of the proximal point mapping and Moreau envelope by embedding them in a parameterized family of proximal mappings and envelopes, respectively. Definition 3.2 (θ-Moreau envelopes). Let f : R n → R ∪ {+∞} and θ, λ > 0. The θ-proximal point operator is the map P θ λ f : R n → R n given by The θ-Moreau envelope is the function e θ λ f : R n → R defined by The following result shows the intimate relation of the θ-envelope and the θmethod objects to the Moreau envelope and the prox-operator.
is bijective on R n . Therefore, we observe that This proves a). In order to prove b) just revisit the above reasoning and observe that if and only if (1 − θ)x + θy ∈ argmin u∈R n f (u) + 1 2λθ x − u 2 .
We readily infer the following result.
Corollary 2 (θ-envelope). For c > 0 let f ∈ Γ c and θ, λ > 0 such that 0 < cθλ < 1. Then the following hold: 4. Proximal-point as gradient descent. In this section, we study the behavior of a function f ∈ Γ c in the θ-method discretization for the gradient descent Note that in the following, θ can be chosen bigger than 1. In the implicit case (θ = 1), the decrease of f along the iteration is straightforward and in Proposition 5, we extend this result for (14). We do not discuss about the rate of convergence and we refer to [4,19]. By Lemma 2.2, the method described in (14) is equivalent to the θ-proximal point method x k+1 = P θ λ f (x k ). To simplify, we assume that f is differentiable but the results can be generalized to the nonsmooth case. We say that ∇f is one-sided L f -Lipschitz if which is equivalent to It is easy to check that if ∇f is a L f -Lipschitz function then ∇f is one-sided L f -Lipschitz. In addition, if f is convex then the two conditions are equivalent.
Proposition 5. f ∈ Γ c be a differentiable function such that ∇f is one-sided L f -Lipschitz and let {x k } be generated by (14) for θ 0. Then In particular, if λ ∈ 0, Proof. Denote x θ = (1 − θ)x k + θx k+1 . By weak convexity and (15), we obtain By definition of x θ , we have which then yields the desired inequality using (14).
, we recover the fact that the descent of f is guaranteed for all λ > 0.
In addition, given f ∈ Γ c , we have already seen that the sequence {x k } generated by (14) can be interpreted as a sequence obtained from applying the gradient descent to the θ-envelope e θ λ f . By Corollary 2, we know that ∇e θ λ f is L-Lipschitz which implies that e θ λ f satisfies (15) and thus the following result follows readily. Proposition 6. Let f ∈ Γ c , θ > 0, λ > 0 such that 0 < cθλ < 1, and let {x k } be generated by (14). Then where L > 0 is the Lipschitz constant in Corollary 2. In particular, if λ 1 L , the sequence {e θ λ f (x k )} decreases. Proof. Follows immediately from (15).
From Proposition 5, we deduce that every accumulation point of {x k } is a stationary point of f . Proposition 7. Let c > 0 and f ∈ Γ c . In addition, assume that f is a C 1 , coercive bounded from below function satisfying (15). Now, let {x k } be generated by (14). Then, for all λ such that > 0, and then |x k+1 − x k | → 0. Therefore we deduce that ∇f (x ∞ ) = 0 by (14).

Remark 4.
The C 1 condition can be relaxed to a lower semicontinuity assumption using the limiting subdifferential, see [4].
We illustrate the above result by two examples. In the first one we revisit Example 1.
The second example concerns the classical Rosenbrock function.
Example 3 (Rosenbrock function). Consider the Rosenbrock function f : Figure 2, we plot the iterations for the gradient descent and the proximal point method with the optimal parameter choice λ = 1 2c . In addition, we observe the decay of the Rosenbrock function.

5.
Perspectives on the proximal point method for weakly convex functions. In this section we present different interpretations of the proximal point method, namely as DC algorithm and proximal-gradient method, all of which provide different insights.

Proximal point method as DC algorithm.
A very popular and powerful algorithm for solving DC optimization problems of the form with g, h ∈ Γ 0 is the so-called DC Algorithm, DCA for short, which goes back to An and Tao, see e.g. [2]. In its simplified version (which coincides with the original version in our setting) it reads as follows: 1. Choose x 0 ∈ dom ∂h; 2. Compute y k ∈ ∂h(x k ); 3. Compute x k+1 ∈ ∂g * (y k ). We point out that DCA applied to (18) is well-defined if (and only if) dom ∂g ⊂ dom ∂h and dom ∂h * ⊂ dom ∂g * , cf. [2, Lemma 1]. Now assume that f ∈ Γ c . As was argued earlier, a natural DC decomposition of f is where, as always, φ λ = f + 1 2λ · 2 . Condition (19) is clearly satisfied. Hence, for any x 0 ∈ R n , the DCA is well-defined and generates the sequences y k = 1 λ x k and x k+1 = ∇φ * λ (y k ) = P λ f (x k ), cf. Proposition 3 b).

Proof. We have
{P λ φ λ (2x)} = argmin y f (y) + 1 2λ y 2 + 1 2λ 2x − y 2 = argmin y f (y) + 1 λ y 2 − 2 λ x, y + 1 2λ 2x 2 = argmin y f (y) + 1 λ y 2 − 2 λ x, y + 1 λ x 2 = argmin y f (y) 6. Final remarks. We studied proximal point-type methods for weakly convex functions where the main results were the following: We investigated the proximal mapping and Moreau envelope for weakly convex (not necessarily smooth) functions while establishing an optimal choice for the regularization parameter. In the smooth case we revealed a connection between the θ-proximal point method and the θ-method for gradient flows. Moreover, under an additional one-sided Lipschitz property we prove a guaranteed decrease of the regularized objective function for the θ-proximal point method. Finally, we gave two different interpretations of the proximal point method for (possibly nonsmooth) weakly convex functions, which provide new insights into the algorithm.