Convergence of the gradient method for ill-posed problems

We study the convergence of the gradient descent method for solving ill-posed problems where the solution is characterized as a global minimum of a differentiable functional in a Hilbert space. The classical least-squares functional for nonlinear operator equations is a special instance of this framework and the gradient method then reduces to Landweber iteration. The main result of this article is a proof of weak and strong convergence under new nonlinearity conditions that generalize the classical tangential cone conditions.


Introduction
A widely-used approach for dealing with a nonlinear ill-posed problem is to phrase it as an operator equation in Banach-or Hilbert spaces and apply an iterative regularization method for its solution [8]. The simplest, though not the fastest, amongst them is the Landweber iteration, which can be viewed as a gradient descent method for the associated least-squares functional. A wellknown convergence theory has been established for Landweber iteration for nonlinear ill-posed problems based on the seminal paper by Hanke, Neubauer, and Scherzer [11]. The pivotal innovation that paved the way for the analysis is to include appropriate restrictions on the "nonlinearity" of the problem by imposing so-called nonlinearity conditions on the underlying operators.
Such conditions have been verified for several important nonlinear ill-posed problems, e.g., for parameter identification in partial differential equations using interior measurements; see, e.g., [17]. However, and this is the crucial point, they have not yet been verified for certain well-studied problems like the electrical impedance tomography (aka. Calderón's problem) [7]-although Landweber and other iterative methods have been successfully applied to them. This might give a hint that the traditionally used nonlinearity conditions are too strong to be satisfied for certain applications, and one may try to replaced them by weaker assumptions.
The main goal of this paper is to prove (local) weak and strong convergence of a subsequence of the gradient descent iterates for a functional with Lipschitzcontinuous gradient imposing more general nonlinearity conditions than the usual ones.
Reviewing such typically-used restrictions reveals that the most common ones are the weak and strong form of the tangential cone conditions [11,22,24]. Stronger than those are the range-invariant conditions [11,18]. Conceptually similar to this is the approach via Hilbert scales [18,19]. A typical functionalbased nonlinearity condition is the assumption that the underlying functional is locally a convex one, or equivalently formulated as the gradient being monotone [6,3,21,13,14] or strongly monotone [20,4]. Except for strong monotonicity, the assumption of a monotone gradient is not enough to prove strong convergence for the classical Landweber iteration. (Note that in [21,13,14] a continuous version was investigated while in [4,6,3] a modified (i.e., regularized) Landweber iteration was considered.) An insightful comparison of tangential cone conditions and several versions of monotonicity of the gradient can be found in [22].
One of the main contribution in this article is to prove boundedness and weak convergence of the gradient descent iterations essentially under a two-parametric nonlinearity condition, which generalizes and includes both the weak/strong tangential conditions and several convexity conditions as special cases. This is interesting insofar as the tangential cone conditions do not imply convexity of the associated least-squares functional, thus our analysis can be viewed as an attempt for a unification of the established nonlinearity restrictions.
We also prove strong convergence of the iterates under a novel restriction which requires the functional to be "balanced" around critical points. This can be seen as a generalization of the strong tangential cone condition. All these results hold both for the exact data-case and the noisy data-case, where in the latter, we employ a simple a-priori parameter choice.
Our setup is phrased as that of the problem of finding minina of general ill-posed functional rather than in the form of nonlinear operator equations, but since, of course, one can apply the least-squares idea, the classical Landweber iteration is a special instance of the gradient iteration studied here. Note that the Lipschitz-continuity of the gradient plays an essential role in our work, hence, certain Banach-space variants of Landweber iterations (see, e.g., [12,23]) in non-smooth spaces are not within the scope of this work.
Our paper is organized as follows: in Section 2 we define the gradient iteration, present the standard assumptions we use and the novel nonlinearity conditions we impose. We study them in detail by relating them to the traditionally used ones. In Section 3, we prove boundedness and weak convergence of the iterates to a stationary point for our setup both in the case of exact and noisy data. In Section 4, strong convergence of the iteration is proven in a similar framework.

Setup and nonlinearity conditions
We consider the problem of finding a solution of an ill-posed problem that is characterized as a global minimum of a certain functional. Throughout this paper, we denote by B ρ (x * ) a ball with center x * and radius ρ in a Hilbert space. We assume given a Fréchet-differentiable, nonnegative functional where X is a Hilbert space, and that a sought-for solution x * satisfies By the nonnegativity of J, x * is a global minimum and has to satisfy the firstorder optimality condition ∇J(x * ) = 0.
The most important instance of such a functional is the least-squares functional J LS (x) for a nonlinear operator equation with given data y, which is defined as In the setup of (1), we assumed that the given data are encoded somehow into the functional J. Similar to the least-squares case, we have to allow for inexact data as well, i.e., we have to consider a "noisy" version of J that represents the actual measurements. Thus, we assume given a Fréchet-differentiable, nonnegative functional where the actual iteration is based upon. In order to solve (2) for x * with given noisy data, a gradient iteration can be used. It is defined iteratively (as long as the iterates x δ k stay in B ρ (x * )) as starting with an initial guess x δ 0 ∈ B ρ (x * ). For the analysis, it is convenient to define the corresponding iteration with exact data as well, starting with the same initial guess x 0 ∈ B ρ (x * ) as in (7). In the least-squares case (5), iteration (7), respectively (8), is the classical Landweber iteration: Note that usually the gradient descent iterations use a stepsize parameter in front of the gradient term. We assume throughout a constant stepsize parameter that is encoded into the functional J such that it will be set to 1 throughout. The only restriction on the stepsize comes from the assumptions that we impose on J and J δ . Essentially, we assume that J and J δ are differentiable on B ρ (x * ) with Lipschitz-continuous derivative and Lipschitz constant smaller than 1. Precisely, we postulate the following: 1. X is a Hilbert space.
3. For some ρ > 0, J and J δ are defined on B ρ (x * ) ⊂ X and are Fréchetdifferentiable there.
As the notation suggests, δ plays the role of the noise level. We note that (10) implies that ∇J δ is Lipschitz continuous with Lipschitz constant L δ It is easy to observe that for the least-squares problem (5) and Landweber iteration, these assumptions are satisfied if F has a Lipschitz-continuous derivative and if the stepsize in Landweber iteration is chosen sufficiently small. The noise level δ according to (13) is then related to the usual one by δ ≥ sup x∈Bρ(x * ) F ′ (x) y δ − y . Since for the least-squares case, we have ∇J δ (x) = F ′ (x) * (F (x) − y δ ), the inequality (12) holds with φ(s) ∼ s.

Nonlinearity conditions
We now propose a two-parametric generalization of the well-known weak tangential cone condition: Definition 2.2. For some γ ∈ [0, ∞] and β ∈ R, we say that NC(γ,β) is satisfied for J if for all x 1 , x 2 inB ρ (x * ) the following implication holds true: Note that we allow γ = 0 and γ = ∞. In the later case, the premise in the implication is tautological, thus the conclusion has to hold for all x 1 , x 2 ∈ B ρ (x * ), while for γ = 0, the conclusion has to hold only for x 1 at a global minimum.
It is easy to verify that the condition in Definition 2.2 is the stronger the larger the γ and the smaller the β is: This condition can be compared to the weak tangential cone condition (or x * -quasi-uniform monotonicity [22]) in the least-squares case: there exists an 0 < η < 1 such that (15) or, equivalently, It is easy to see that for Fréchet-differentiable F , (15) with η ∈ (0, 1) implies (14) with a negative β and γ = 0 for the associated least-squares functional. It was shown [22] that (15) with η ∈ (0, 1) implies weak convergence for the Landweber iteration with exact data. In [24], the condition NC(0,β) with β < 0 was imposed and again weak convergence of the exact Landweber iteration was proven (and also strong convergence for a modified form of the iteration).
We generalize these results insofar as we also verify weak convergence in the noisy case and more interesting, we prove that (14) with γ = 0 and any β ∈ R (also positive ones! and in particular with (15) with η = 1) already implies weak convergence of a subsequence of the gradient iteration.
Strong convergence of gradient iterations require a stronger nonlinearity condition than the previous ones. For our convergence analysis, we need additionally to (14) the following, which we denote as balancing condition.
for some ρ 0 < ρ and any sequence z n with ρ 0 ≤ z n ≤ ρ there exists a τ > 0 and a n 0 ∈ N such that We will prove strong convergence of the gradient iteration under the condition that J is γ-balanced and satisfies NC(γ,β) for some γ ≥ 0 and some β ∈ R.
The condition in Definition 2.3 can sloppily be interpreted as the requirement that J does not have extremely large values when evaluated at a mirror point around x * . Thus, the functional should roughly behave in a similar way left and right at x * on a line through x * .
It is easy to verify that if J is convex on B ρ (x * ) and satisfies a symmetry condition with a constant C, then (17) holds. Traditionally, strong convergence of the Landweber iteration is verified under the so-called (strong) tangential cone condition (or strong Scherzer condition) (see, e.g., [11,8,17,22]): there exists 0 < η < 1 such that It is obvious that the strong tangential cone condition implies the weak one.
There are several interesting conclusions that follow from (18). For instance, the following useful estimate follows immediately from (18): Instead of (18) an even stronger condition, is sometimes imposed: it postulates the existence of a family of operators R x such that Locally, it follows that R x are invertible operators, which allows to compare the derivatives at different points x. It follows that this condition implies (18) (possibly on a smaller ball). Let us briefly study the relations of conditions NC(γ,β) and Definition 2.3 with the both the tangential cone conditions and certain convexity conditions. At first we introduce a generalization of quasiconvexity: Definition 2.4. Let C be a convex set and let 0 ≤ γ ≤ 1. We say that the functional f is γ-quasiconvex if for all x, y ∈ C and λ ∈ (0, 1) we have For γ = 1, we encounter the traditional definition of quasiconvexity [2], which might also be phrased as the condition that the following inequality holds: For positive functionals f , the assumption of γ-quasiconvexity is weaker than quasiconvexity, which itself is in any case weaker than convexity. In terms of level-sets, it is easy to see that f is quasiconvex if and only if all its lower level sets {f ≤ α} are convex. We may view γ-quasiconvexity as the condition that the convex hull of {f ≤ γα} does not intersect the complement of {f ≤ α}.
It is interesting that the weak tangential cone condition can also be expressed by a derivative-free condition: Proposition 2.6. Let F be Fréchet-differentiable. Then, condition (15) holds for some η ∈ [0, 1] if and only if the least-squares functional J LS (x) := 1 2 F (x)− F (x * ) 2 has the property that the mapping is monotonically increasing for t ∈ [0, 1] for all x ∈ B ρ (x * ).
Next, we verify that the strong tangential cone condition implies NC(γ,β) with appropriate parameter values.
Lemma 2.7. Let the tangential cone condition (18) hold with η < 1. Then the Proof. By expanding the terms, we verify that (18) is equivalent to the inequality for all x, z ∈ B ρ (x * ). Assume that J(x 1 ) ≤ γJ(x 2 ). Using (18) with x = x 2 , z = x 1 , Young's inequality, and the triangle inequality, we obtain This lemma justifies our claim that NC(γ,β) is a generalization of the tangential cone condition. Note, however, that quasiconvexity (i.e., NC(1,0)) or even convexity of the least-squares functional is not implied by the tangential cone conditions, while by our results, strong convergence holds for quasiconvex (and convex) functionals if the balancing condition is additionally satisfied.
Concerning the balancing condition, it can be shown that the least-squares functional is γ-balanced if F satisfies the strong tangential cone condition.

Proof. Indeed it follows from (19) that
We note that for convergence of the Landweber iteration, often the tangential cone condition is imposed with η < 1 2 . A consequence of our results is that strong convergence also follows with η < 1.
We provide another sufficient condition for the balancing condition (17) if the classical weak tangential cone condition holds. Lemma 2.9. Let x * is the unique global minimum in B ρ (x * ) and let the weak tangential cone condition (15) hold for some η < 1. If for any sequence with ∆ n → 0 lim sup holds, then (17) is satisfied for any γ > 0.
This also illustrates that for linear problems, the balancing conditions is trivial. Indeed, as the functional J(x * + ∆) is a quadratic form (A∆, ∆) then, the ratio in (24) is always 1. If J can be estimated around x * from below and above by a constant times an even-homogeneous functional (similar to (19)), then (24) is satisfied.
As a justification for our claim of a unification of nonlinearity conditions, we present the implications of these conditions in the following scheme: While most of the traditional (weak) convergence proofs use the left (separated) assumptions, we employ a weaker version (right-hand side) that includes both of them as special cases. The main result about weak convergence in this paper is indicated in the last line of this table.

Weak convergence
For the following analysis, it is convenient to introduce some shorthand notations both for the noisy and exact case: The gradient iterations can then be written as The first lemma concerns monotonicity of the functional values.
Lemma 3.1. Let Assumption 2.1 hold and let x k , x k+1 ∈ B ρ (x * ) be defined by (8). Then the functional values are monotonically decreasing: Moreover, if x k ∈ B ρ (x * ) for k = 0, . . . N , then Proof. By Lipschitz continuity and Assumption 2.1, we have using Thus with (29) and (4), we obtain which proves the first assertion. A telescope sum, yields the second result.
By completely the same proof and by replacing J by J δ and using the "noisy" variables instead of the exact ones, we can verify the analogous result for J δ .
Lemma 3.2. Let Assumption 2.1 hold and let x δ k be defined by (8) and let x δ k , x δ k+1 ∈ B ρ (x * ). Moreover, assume for the Lipschitz constant of J δ that L δ < 1. Then the corresponding residuals are monotonically decreasing: Next, we consider uniform bounds for the error for the iteration with the exact functional J. We recall the definition of the positive part f + := max(f, 0). Proof. By (27) and with (14), we have for k ≤ N e k+1 2 = e k 2 − 2 ∇J k , e k + ∇J k 2 ≤ e k 2 + 2β ∇J k 2 + ∇J k 2 = e k 2 + (1 + 2β) ∇J k 2 .
By telescoping we find with Lemma 3.1 This lemma gives boundedness of the exact Landweber iteration. Suppose that x 0 is such that Proof. We proceed by induction. Clearly x 0 ∈ B ρ (x * ) by (32). Suppose that x l ∈ B ρ (x * ) for all 0 ≤ l ≤ k. Then Lemma 3.3 with N = k yields that e k+1 < ρ, thus, x k+1 ∈ B ρ (x * ). By induction it follows that x k ∈ B ρ (x * ) for all k ≥ 0.
Next, we consider the noisy iteration and verify a uniform bound for e δ k . The first lemma provides a recursive estimate.
The next lemma provides a uniform bound.
Lemma 3.6. Let Assumption 2.1 hold and let L δ < 1. Suppose that x δ k ∈ B ρ (x * ) for k = 0, . . . , N , and let NC(0, β) hold for some β ∈ R. Define ξ = max{1, 2 β + }. Then Proof. We proceed by induction over N . Let N = 0 and assume that x 0 ∈ B ρ (x * ). For k = 0 we have by (33) Since the last term is negative by definition of ξ, the estimate holds for k = 0 = N . Now suppose that if x δ k ∈ B ρ (x * ) for k = 0, . . . , N − 1, then the estimate (35) holds for for k = 0 . . . , N − 1. We show that this is also the case when N is replaced by N + 1. Thus, let x δ k ∈ B ρ (x * ) for k = 0, . . . , N . By the induction hypothesis we only have to show that (35) holds for k = N .
By Lemma 3.2, we obtain For brevity, define κ = θ |1−L δ | J δ (x 0 ). By Lemma 3.5 and since ξ ≥ 1, we find According to the induction hypothesis we have (35), which allows to estimate the first three terms on the right-hand side. Moreover, by (36) and (35), again e δ N can be bounded by the right-hand side in (35). Thus By completing the square as before and since (4β + − ξ 2 ) ≤ 0 we find (35) for k ≤ N, which proves the lemma.
We have the following proposition: Proposition 3.7. Let Assumption 2.1 and NC(0, β) hold for some β ∈ R. Let L δ < 1 in B ρ (x * ) and x δ 0 and N ≥ 0 be such that where θ is defined in (34). Then for all k ≤ N, the sequence x δ k is in B ρ (x * ) and we have the estimate (35) for k = 0, . . . , N .
Proof. We use induction over k ≤ N . For k = 0, x δ 0 is in B ρ (x * ) by (37). Let x δ l ∈ B ρ (x * ) for l = 0, . . . k, k < N . We show that x δ k+1 ∈ B ρ (x * ). From (36) (with the sum up to the index k − 1) and (35) we may estimate Using (12) for the last term on the right-hand side and by Lemma 3.2 and J δ (x k ) ≤ J δ (x δ 0 ), we observe that e δ k+1 ≤ ρ, thus x k+1 ∈ B ρ (x * ). Induction yields the assertion. The estimate (35) follows from Lemma 3.6 Since it is well-known that the Landweber iteration has to be stopped for noisy data, we have to introduce a stopping criterion. Here we choose a simple a-priori rule: for each noise level δ define the stopping index N δ such that We have the following theorem.
Theorem 3.8. Let Assumption 2.1 and NC(0, β) hold for some β ∈ R. Let x 0 be close to x * such that Let δ l be a sequence of noise levels associated to noisy data via (10) and let them be sufficiently small such that holds, and let the stopping index be chosen as in (38). Then x δ l N δ l is in B ρ (x * ) and hence has a weakly convergent subsequence.
If x → ∇J(x) is weakly sequentially closed on B ρ (x * ), then a limit of this subsequence is a stationary point of J. Assume additionally that x * is the unique stationary point of J in B ρ (x * ). Then as δ l → 0.
Proof. Since δ l is small, from (13) if follows that J δ is Lipschitz with L δ < 1 and 1 1−L δ ≤ 2 1−L for all δ = δ l . With (38) and (39), it may be verified that (37) holds for all δ l and with N = N δ + 1. Thus, by Proposition 3.7, the iterates x δ k ∈ B ρ (x * ) for all indices k up to the stopping index N δ + 1. In particular, is bounded and has a weakly convergence subsequence.
From (11) we find that By Corollary 3.4, the sequence x k is in B ρ (x * ), hence by Lemma 3.1, the sequence J k is decreasing and hence convergent. Since N δ l → ∞, and δ l → 0, we conclude by (10) that lim Then, by weakly closedness, x δ l ⇀x and ∇J(x δ l ) → 0 ⇒ ∇J(x) = 0.
Hence the limit point is a stationary point. If the stationary point in B ρ (x * ) is unique, the any subsequence has a weakly convergent subsequence with limit x * , thus x δ N δ must converge weakly to x * .

Strong convergence
The next step in the analysis concerns a proof of strong convergence of the iterations. As it could be expected, this requires additional conditions, namely the functional has to be γ balanced and satisfies NC(γ,β). [(e k+1 − e k , e k + e k+1 )] Moreover, from (∇J k , e k+1 ) = (∇J k , e k ) − ∇J k 2 we obtain We split the sum into I 1 = {k ∈ [n 1 , n 2 ] | ∇J k , e k ≥ 0} and I 2 = {k ∈ [n 1 , n 2 ] | ∇J k , e k < 0} and use (14) to find According to Lemma 3.3 and 3.1, the right-hand side is uniformly bounded.
Lemma 4.2. Let Assumption 2.1 hold. Suppose that J satisfies NC(γ,β) and is γ-balanced for some γ ≥ 0 and for some β ∈ R. For a subsequence e km assume that lim inf m e km ≥ c 0 > 0. Then for any 0 ≤ s < k m and m ≥ m 0 Proof. By (17), we find a τ > 0 with Thus if s ≤ k m , and m ≥ m 0 , we have by Lemma 3.1 that J(x * − τ e km ) ≤ γJ(x km ) ≤ γJ(x s ). We apply (14) with x 2 = x s and x 1 = x * − τ e km . It then holds that x 1 , x 2 ∈ B ρ (x * ). This yields Summing up we arrive at the following theorem on convergence of the exact iteration. Theorem 4.3. Let Assumption 2.1 hold and suppose that J satisfies NC(γ,β) and is γ-balanced for some γ ≥ 0 and for some β ∈ R. Let x 0 − x * small enough such that (32) holds. Then x k has a strongly convergent subsequence with a limit that is a stationary point. If x * is the unique stationary point of J in B ρ (x * ), then the sequence x k converges to x * .
Proof. By Corollary 3.4, e k is bounded for all k. Hence, there exist a subsequence, where e km is convergent. If e k has a strongly convergent subsequence with limit 0, then we are finished. Otherwise, for any subsequence we have lim inf e km ≥ c 0 > 0, in particular also for one for which e km is convergent. Take such a subsequence and write for k m > k n ≥ m 0 , where m 0 is the index in (17)  By (41) and since e km is convergent, we may find for any given ǫ an n 0 ≥ m 0 such that for all k m > k n > n 0 the right-hand side is smaller than ǫ. Thus, e kn is a Cauchy sequence and hence convergent. Since by Lemma 3.1, ∇J km → 0, and ∇J is continuous, it follows that the limitx must be a stationary point. If x * is the only possibility of such a limit, it follows by a standard subsequence argument, that x k must converge to x * .
We now come to the main result of strong convergence in the noisy case. Concerning the stopping criterion, we define for each noise level δ the stopping index N δ according to (38). Then we have the following theorem.
Theorem 4.4. Let Assumption 2.1 hold and suppose that J satisfies NC(γ,β) and is γ-balanced for some γ ≥ 0 and for some β ∈ R. Let x 0 − x * small enough such that (32) holds. Let the sequence of noise levels δ l → 0 be sufficiently small such that (40) holds and let the stopping index be chosen as in (38).
Assume that x * is the unique stationary point of J in B ρ (x * ). Then lim δ→0 x δ N δ = x * , Proof. As in Theorem 3.8, e δ k for k = 0, . . . N δ + 1 is bounded by ρ such that x δ N δ is always in B ρ (x * ) and hence uniformly bounded. Take a fixed m and assume that δ is sufficiently small such that N δ ≥ m. From (33) we may estimate recursively that (using the fact that we may take β = 0) θ ∇J k 2 + 2θN δ δ 2 + 2δN δ ρ + 4β + δ 2 N δ , where we used (10) in the last step. The recursion for x δ k −x k might be estimated by Assumption 2.1: Thus for N δ ≥ m we have Fix an ǫ > 0. Since the assumptions imply that e m converges to 0 and the sum of the squares of the gradients is convergent, and by the parameter choice (38), we may find a m (depending on ǫ) and a δ 0 such that for all δ ≤ δ 0 x δ N δ − x * 2 ≤ 2δ 2 1 L 2 ((1 + L) m − 1) 2 + ǫ. Taking δ even smaller (depending on m) yields that the right-hand side is smaller than 2ǫ. Thus lim δ→0 x δ N δ = x * . As a corollary we obtain a result which cannot be proven by the approach via tangential cone conditions. x δ N δ = x * ,

Conclusion
In this paper, we considered gradient descent iterations for functionals with Lipschitz-continuous derivative. We introduced new restrictions on the nonlinearity of the problem, namely the conditions NC(γ,β) and the γ-balancing conditions. We have shown that they are weaker than several classical nonlinearity conditions.