Rate of convergence of inertial gradient dynamics with time-dependent viscous damping coefficient

. In a Hilbert space H , we study the convergence properties when t → + ∞ of the trajectories of the second-order diﬀerential equation


Introduction
Throughout the paper, H is a real Hilbert space which is endowed with the scalar product •, • , with x 2 = x, x for x ∈ H.As a standing assumption, we suppose that Φ : H → R is a convex differentiable function whose gradient ∇Φ is Lipschitz continuous on bounded sets, and whose solution set S = argmin Φ is nonempty.For a fixed t 0 ∈ R, we consider the second-order differential equation (IGS) γ ẍ(t) + γ(t) ẋ(t) + ∇Φ(x(t)) = 0, t ≥ t 0 , where γ(•) is a positive time-dependent damping parameter of class C 2 .The appellation (IGS) γ is for Inertial Gradient System with parameter γ(•).We will give conditions directly expressed on γ(•) which allow us to describe the asymptotic behavior of the trajectories when t → +∞.In particular, we will evaluate the convergence rate of the values Φ(x(t)) − min H Φ, and of the global energy.
1.1.Historical presentation.This study is part of the active research that has been devoted this last decade to the accelerated gradient methods for convex optimization, and their links with the continuous gradient-like inertial dynamics.For recent papers on inertial gradient systems with time-dependent friction, the reader is referred to [1,2,4,5,8,9,11,12,13,14,17,19,20,21,26].Our paper comes as a natural complement to the recent article by Attouch and Cabot [2].One of our goals is to broaden the scope of this study to cover the important case γ(t) = α t for all positive values of the parameter α.Let us briefly explain the importance of this case, and the recent progress regarding its study.
The tuning of the damping parameter γ(t) plays a central role for obtaining fast gradient methods in numerical optimization.Recent developments show the close relationship between fast gradient methods for convex optimization and inertial gradient systems with vanishing damping coefficient, γ(t) → 0 as t → +∞.As pointed out by Su-Boyd-Candès in [26], the (IGS) γ system with γ(t) = 3 t (1) ẍ(t) + 3 t ẋ(t) + ∇Φ(x(t)) = 0 can be seen as a continuous version of the accelerated gradient method of Nesterov (see [22,23]).This method has been developed to deal with large scale structured convex minimization problems, such as the FISTA algorithm of Beck-Teboulle [10].These methods guarantee (in the worst case) the convergence rate Φ(x k )−min where k is the number of iterations.Convergence of the trajectories generated by (1) and of the sequences generated by FISTA, has not been established so far (except in the one dimensional case, see [5]).This is a puzzling question in the study of numerical optimization methods.By making a slight change in the coefficient of the damping parameter, one can overcome this difficulty.Recently, Attouch-Chbani-Peypouquet-Redont [4] and May [21] showed convergence of the trajectories of the (IGS) γ system with γ(t) = α t and α > 3 (2) ẍ(t) + α t ẋ(t) + ∇Φ(x(t)) = 0.
They also obtained the improved convergence rate Φ(x(t)) − min H Φ = o( 1 t 2 ) as t → +∞.Corresponding results for the algorithmic case have been obtained by Chambolle-Dossal [15], and by Attouch-Peypouquet [6].The subcritical case α < 3 has been recently studied by Apidopoulos-Aujol-Dossal [1], Aujol-Dossal [8] and Attouch-Chbani-Riahi [5].In this case, the convergence rate of the values is Φ(x(t)) − min H Φ = O(t − 2α 3 ), and this is the best that can be guaranteed whatever the function Φ, see [1].In that paper, the optimality result has been proved by using a non-differentiable function, namely Φ = | .| on R. Optimality is then understood for functions that satisfy a suitable differential inclusion extending (2).Similar results are valid for the corresponding algorithms.
1.2.Presentation of the results.Let us introduce the basic ingredients used in the description of the results.Recall that the global energy is defined by As a direct consequence of the equation (IGS) γ , differentiating W (•) we get that Hence, d dt W (t) ≤ 0, and W (•) is a nonincreasing function.Assume now that Φ = 0. Then equation (IGS) γ reduces to the linear differential equation Let us multiply this equality by the integrating factor p(t) = e t t 0 γ(τ ) dτ and integrate on [t 0 , t].We obtain p(t) ẋ(t) = ẋ(t 0 ) for every t ≥ t 0 .By integrating again, we find It ensues immediately that the trajectory x(.) converges if and only if ẋ(t 0 ) = 0 or +∞ t0 ds p(s) < +∞.
Let us come back to the case of a general function Φ and assume that the above condition is satisfied.We then define the function Γ : The theorem below gives the basic assumptions on the function γ, that allow to obtain convergence of the trajectories of (IGS) γ .
Let us assume that γ : [t 0 , +∞[→ R + is a continuous function satisfying: Then every solution trajectory x : [t 0 , +∞[→ H of (IGS) γ converges weakly toward some x * ∈ argmin Φ.Moreover, the energy function W satisfies the following rate of convergence: Taking γ(t) = α/t with α > 1 gives after some elementary computation Then, the condition γ(t)Γ(t) ≤ m with m < 3 2 is equivalent to α > 3.As a consequence, for γ(t) = α/t and α > 3, we recover the convergence of the trajectories of (IGS) γ , together with the rate of convergence The abovementioned theorem does not allow to handle the subcritical case, corresponding to α ≤ 3.This is the reason why we consider in this paper an alternative approach, based on a new Lyapunov function.We assume that γ(•) is of class C 2 and satisfies the following condition: There exists some positive real number 0 < r ≤ 1 3 such that, for t large enough, (H γ,r ) γ(t) ≥ 2r 2 γ(t) 3 .
The value r = 1 3 turns out to be critical (it corresponds to α = 3 in the case γ(t) = α t ).Under the assumption (H γ,r ) with r ∈]0, 1/3], we show in Section 2 (see Theorem 2.1) that as t → +∞.In Section 4 we apply the above results to some particular situations.In addition to the important case γ(t) = α t (α > 0), we also consider the case γ(t) = 1 t(ln t) β , 0 ≤ β ≤ 1.Finally, in Section 5, we study the perturbed version of (IGS) γ obtained by introducing a second member g in the dynamics.We show in Theorem 5.1 that the convergence results remain satisfied if the perturbation g is not too large asymptotically.This reflects some structural stability of the dynamics (IGS) γ .

Rate of convergence of the energy
Let us state our main result concerning the rate of convergence of the energy for (IGS) γ .
Theorem 2.1.Let Φ : H → R be a convex differentiable function whose solution set argmin Φ is nonempty.Let us give t 0 ∈ R and γ(•) : [t 0 , +∞[→ R + a nonincreasing function that is twice continuously differentiable.Suppose that there exists some positive real number 0 < r ≤ To obtain the two last equations we have successively used (IGS) γ and the convexity of Φ. Adding the above results, we obtain, after simplification, ( Let us successively examine the different terms entering the second member of (7).Let us make the first two terms equal to zero by taking respectively Let us combine (8) with (9), and use that λ(t) = 2rp(t) r ( γ(t) + rγ(t) 2 ).We obtain Recall that we want E(•) to be a nonincreasing function.Therefore, we require the third term of the second member of (7) to satisfy Taking into account (8), that is λ(t) = 2rγ(t)p(t) r , this is equivalent to assuming that (H 1 ): r ≤ 1 3 .Let's finally consider the last term of (7), and study under which conditions on the parameters, the inequality ( 11) By differentiating, we obtain, As a consequence, inequality ( 11) is satisfied under the following additional assumption: for t large enough In summary, we have proved that under the assumptions (H 1 ) and (H 2 ), the energy function E(•) is nonincreasing.Let us simplify this set of conditions.First observe that Under condition (H 1 ), we have 3r − 1 ≤ 0. Therefore (H 2 ) is satisfied under the two conditions So our set of assumptions amounts to (H 1 )-(H 3 )-(H 4 ).Let's reduce it by observing that (H 3 ) implies (H 4 ).This follows from an integration procedure.Recall that γ(•) is assumed to be nonincreasing (indeed it is a necessary condition to get (H 4 )).Therefore lim t→+∞ γ(t) = l exists with l ≥ 0. Let's show that necessarily l = 0. Otherwise, by (H 3 ), we would have γ(t) ≥ r 2 l 3 for t large enough, say t ≥ T .By integrating this inequality from T to t, we would have γ(t) ≥ γ(T ) + r 2 l 3 (t − T ) for t ≥ T .This would imply lim t→+∞ γ(t) = +∞, which in turn gives lim t→+∞ γ(t) = +∞, a clear contradiction with γ(•) nonincreasing.So lim t→+∞ γ(t) = 0. Let us multiply (H 3 ) by γ(t).Since this quantity is less than or equal to zero, we obtain Let us integrate this differential inequality from t to T ≥ t.We obtain Letting T → +∞, and using lim T →+∞ γ(T ) = 0, we obtain that's (H 4 ).To summarize, we proved that under the assumptions (H 1 ) and (H 3 ), d dt E(t) ≤ 0 for t large enough, and hence the function E(•) is nonincreasing.Let us complete this result by showing that, under these assumptions, E(•) is nonnegative.Returning to the definition (5) of E(•), this will result from ξ(•) ≥ 0. By (10), and since −2rp(t) 2r ≤ 0, this property is equivalent to Let us now observe that under condition (H 1 ), that is r ≤ 1 3 , we have −r ≤ 1 − 4r.Therefore, ( 13) is implied by assumption (H 4 ), which itself is a consequence of (H 3 ).Going back to the definition of E(•), and because of ξ(•) ≥ 0, we obtain that for t large enough, say t ≥ t 1 and the result follows.Let us now suppose that the trajectory x(•) is bounded.From E(t) ≤ E(t 1 ) for t ≥ t 1 , the definition of E(•) and ξ(t) ≥ 0, we deduce that According to the triangle inequality, it follows that Recall that λ(t) = 2rγ(t)p(t) r , as defined in (8).Therefore, the above inequality gives  Remark 2.2.For any r, r > 0 such that r ≥ r , condition (H γ,r ) clearly implies (H γ,r ).Therefore, if (H γ,r ) holds true for some r ≥ 1/3, then the conclusions of Theorem 2.1 are satisfied with r = 1/3 in place of r.
Remark 2.3.In order to verify condition (H γ,r ) in Theorem 2.1, one just needs to consider the asymptotic equivalents of the functions γ(•) and γ(•).This gives flexibility to this approach.Observe however that the coefficient r may change when considering an equivalent expression.If γ satisfies condition (H γ,r ) for some r > 0, an equivalent of γ may fulfill the condition only for r < r, and vice versa.
Remark 2.5.The important role of the quantity γ(t) γ(t) 2 in the asymptotic analyis of (IGS) γ was already underlined in [2].In that article, it came in the form γ(t) γ(t) 2 ≥ −c with c ∈]0, 1[.This shows the complementary aspect of our analysis, which is based on the opposite inequality.
Remark 2.6.Under the assumption γ(t) + rγ(t) 2 ≤ 0 for some r ∈]0, 2/3], Cabot-Engler-Gadat [12] obtained the following estimate for the energy decay Optimality of the rate of convergence results.Let us discuss the optimality of the convergence rates (3) and (4) obtained in Theorem 2.1.Optimality is understood in the worst case, which means that, for a given function γ(•), we are able to find potential functions Φ for which equality is reached (or approximated with arbitrary precision) in formulas (3) and (4).Such a discussion for an arbitrary damping function γ(•) is difficult because, in general, it is impossible to compute the general solution.In some cases, we can provide special solutions with a computable decay rate, allowing us to conclude to the optimality of these formulas.This is the case when γ(t) = α t .The corresponding differential equation ( 15) ẍ(t) + α t ẋ(t) + ∇Φ(x(t)) = 0 plays an important role in optimization, because of its relation with the accelerated method of Nesterov, see [4,26].We will distinguish the two cases α ≥ 3 and α ≤ 3, which turn out to be two distinct regimes.a) The case α ≥ 3.In this case we have W (t) = O1 t 2 as a consequence of Theorem 2.1.See Section 4 for a detailed analysis.Let us show the optimality of this estimate.Following [4,Example 2.13], take H = R and Φ(x) = c|x| δ , where c and δ are positive parameters.Let us look for positive solutions of ( 16) of the form x(t) = 1 t θ , with θ > 0. This means that the trajectory is not oscillating, which corresponds to over damping (also called heavy damping).We begin by determining the values of c, δ and θ that provide such solutions.On the one hand, t θ is a solution of ( 16) if, and only if, i) θ + 2 = θ(δ − 1), which is equivalent to δ > 2 and θ = 2 δ−2 ; and ii . We have min Φ = 0 and Hence the decay rate of the energy W (t) is of order . As δ tends to infinity, the exponent 2δ δ−2 tends to 2 from above, which shows in this case the optimality of the decay rate W (t) = O 1 t 2 given by Theorem 2.1.This limiting situation is obtained by taking a function Φ that becomes very flat around the set of its minimizers.
To achieve exactly the decay rate W (t) ∼ C t 2 , we must take a function Φ extremely flat around its infimum, as given by the inverse of an exponential growth.Following [4,Example 2.11], take H = R and Φ(x) = α−1 2 e −2x with α ≥ 1.Let us verify that x(t) = ln(t) is a solution of (15).On the one hand, Thus, x(t) = ln t is a solution of (15).Let us examine the decay rate.We have inf Φ = 0, and , which gives 1 a decay rate of W (t) of order 1 t 2 .b) The case α ≤ 3.In this case, we have as a consequence of Theorem 2.1.See Section 4 for a detailed analysis.Let us show the optimality of this estimate.The over-damped solution used in the case α ≥ 3 does no longer work.According to Aujol-Dossal [8], the idea is to take Φ(x) = |x| δ , and let the exponent δ tend to 1.For δ > 1, it is shown in [8,Proposition 6] that The optimality of the decay rate is obtained by letting δ tend to one in this formula.To find a potential function Φ for which this rate is exactly obtained, we could take directly δ = 1 as in Apidopoulos-Aujol-Dossal [1, Theorem 5.1], but in this case we have to consider a differential inclusion instead of a differential equation, see [3].
Remark 2.7.For clear numerical reasons, the multiplicative coefficient involved in the convergence rate also plays an important role.Its optimization is the subject of an active research trend, see Drori-Teboulle [16] and references therein.

Some other examples.
Let us now give some other examples for which we are able to determine explicitly the rate of convergence of the energy.Taking Φ ≡ 0, we obtain ẋ(t) = ẋ(t 0 )e − t t 0 γ(s)ds , see the introduction.
We then have for every t ≥ t 0 , The above rate of decay is the best that can be expected for the energy function W (t), as shown by the next result.
Proposition 2.1.Let Φ : H → R be a function of class C 1 such that argmin Φ = ∅.Let γ : [t 0 , +∞[→ R + be a continuous function.Then any solution x of (IGS) γ satisfies for every t ≥ t 0 , Proof.Taking into account the expression of W and Ẇ , we have Let us multiply the above inequality by e Formula (17) immediately follows.
Assuming now that Φ = 1 2 . 2 , the next proposition gives an equivalent of the energy W (t) as t → +∞.Proposition 2.2.Assume that Φ = 1 2 . 2 and that γ : [t 0 , +∞[→ R + is a function of class C 1 such that lim t→+∞ γ(t) = 0 and γ ∈ L 1 (t 0 , +∞).Let x be a solution of equation (IGS) γ .Then, either the solution x is stationary and equal to zero, or there exists C > 0 such that This result was obtained by Cabot-Frankel [14, Lemma 2.2] in the framework of linear hyperbolic evolution equations.For the sake of completeness, a self-contained proof is provided in the Appendix.
The above examples suggest that the speed of convergence of W (t) as t → +∞ depends on the "index" of convexity of the function Φ.This can be quantified as follows.Let θ > 0 and Φ : H → R be a function such that argmin Φ = ∅.The function Φ is said to be θ-power convex if (Φ − min H Φ) θ is convex.For θ = 1, we recover the classical notion of convexity.Observe that Φ = 1 2 . 2 is 1 2 -power convex, while Φ = 0 corresponds to θ → 0. For θ > 1, the class of θ-power convex functions goes beyond convexity.The class of θ-power convex functions was handled by Cabot-Engler-Gadat [12,Section 3] in connection with the dynamical system (IGS) γ .More recently, the notion of θ-power convexity was used by Aujol-Dossal in the case γ(t) = α t .These authors quantified the rate of convergence of the energy W (t) as a function of α and θ, see [8,Theorem 1].The latter suggests to extend Theorem 2.1 to the framework of θ-power convex functions.This is out of the scope of the present paper, but certainly indicates matter for future investigation.
3. Rate of decay of the energy: case 0 < r < 1   3   Let us complete the preceding results by analyzing the decay rate of the global energy in the case 0 < r < 1  3 .We will get a better convergence rate, by passing from O estimates to o estimates.Recall that the global energy function t → W (t) ∈ R + , which is the sum of the potential energy and the kinetic energy, is written Our proof lies in establishing integral estimates for velocities and values.For this, we return to the basic estimate (7), which we recall below ( 18) Recall that this inequality is valid regardless of the choice of parameters.It simply uses the convex subdifferential inequality.Precisely, we will obtain these estimates by making different choices for the parameters λ(•) and ξ(•).

3.1.
Integral estimate of the velocities.Following the proof of Theorem 2.1, let's take in (18) the parameters λ(t) and ξ(t) respectively given by ( 8) and ( 9).So we have With this precise choice of the parameters, the two first terms of the right member of ( 18) are equal to zero.We showed that, under condition (H γ,r ) with r ∈]0, 1/3], the last term of ( 18) is less than or equal to zero.So, ( 18) reduces to This inequality brings information when 0 < r < 1 3 .As a direct consequence of (20) we obtain the following result.
The following table gives a synthetic summary about the speed of convergence of the energy W (t) as t → +∞.Recall that W (t) → 0 reflects both the rate of convergence to zero of the value Φ(x(t)) − min H Φ, and of the velocity ẋ(t) .
Corollary 4.1.Given α > 0, assume that γ(t) = α t for every t ≥ t 0 > 0. Let x(.) be a solution of (IGS) γ .Then, based on α lower or higher than 3, we have the following results: a) Case α ≤ 3: Thus, we get a unified picture of the convergence properties for (IGS) γ in the case γ(t) = α t , α > 0. Cases α > 3 and α ≤ 3 are clearly asymmetric.When α > 3, the trajectories of (IGS) γ converge weakly and hence are bounded, see Theorem 1.1.On the other hand, when α ≤ 3, the convergence of the trajectories is not known, nor their boundedness.This is why the boundedness assumption is required in this case, see Corollary 4.1 a).We emphasize that historically the case α ≥ 3 was first solved.It is only recently that the case 0 < α ≤ 3 has been well understood.
Remark 4.1.In [2] the authors give a complete picture in the case γ(t) ≥ 3 t .In particular, this study covers the case γ(t) = c t θ for 0 < θ < 1, which models a slow decrease to zero of the damping function.Our approach complements this study and covers the "subcritical case" γ(t) ≤ 3 t .It might be tempting to have a unifying analysis including these two cases.Each of them relies on different Lyapunov functions.We believe that they rely on different regimes, and that a unifying theory would be too complicated technically, and therefore would not offer decisive progress.We have to distinguish two cases: a) β = 1.In this case, p(t) = ln t ln t0 , and we get from Theorem 2.1 applied with r = 1 Let us observe that when β → 0 or β → 1, we recover the previous formulae.
We can now complete the table giving a synthetic view about the speed of convergence in the different situations studied above.

The perturbed case
We consider the equation with a second member g : [t 0 , +∞[→ H. Depending on the situation, g is a forcing term, or comes from the approximation or computational errors in (IGS) γ .To ensure existence and uniqueness for the corresponding Cauchy problem, we suppose that g is locally integrable.We will show that the results of the previous sections remain satisfied if the perturbation g is not too large asymptotically.This reflects some structural stability of the dynamics (IGS) γ . Theorem Note that the term ẍ(t)−g(t) = −(γ(t) ẋ(t)+∇Φ(x(t))) gives exactly the same expression as in the unperturbed case.Therefore, by a similar argument, we obtain that, under the assumption (H γ,r ) with r ∈]0, 1/3], d dt E(t) ≤ 0 for t ≥ t 1 .As a result, the energy function E(•) is nonincreasing.In particular, for t ≥ t 1 , we have E(t) ≤ E(t 1 ), which gives, by definition of E(•) (25)  Recalling that lim t→+∞ γ(t) = 0, the expression of W shows that (28) W (t) ∼ W (t) as t → +∞.
We deduce from ( 26), ( 27) and (28) the existence of t 1 ≥ t 0 such that for every t ≥ t Observe that if F (t 2 ) = 0 for some t 2 ≥ t 1 , then we have W (t 2 ) = 0 and W (t 2 ) = 0. Since the map W is nonincreasing, we conclude that W (t) = 0 for every t ≥ t 2 , i.e. the solution x is stationary.Now assume that F (t) > 0 for every t ≥ t 1 and divide each member of inequality (29) by F (t). Since γ ∈ L 1 (t 0 , +∞) by assumption, we deduce that It ensues that lim t→+∞ ln F (t) exists in R. We deduce that lim t→+∞ e t t 1 γ(s)ds W (t) = K > 0. The conclusion immediately follows from estimate (28).The use of such an auxiliary function is classical, see for example [18,Lemma 3.2.6] in the case of an autonomous damping.

2
Lyapunov function.At this point z ∈ argmin Φ is fixed, while r is a positive real number, λ(•), ξ(•) are positive functions we will use as parameters.The key of the proof is to wisely choose these parameters in order to make E r,λ,ξ (•) a nonincreasing function.In short, without ambiguity, we will write E instead of E r,λ,ξ .