AN EFFICIENT PROJECTION METHOD FOR NONLINEAR INVERSE PROBLEMS WITH SPARSITY CONSTRAINTS

. In this paper, we propose a modiﬁcation of the accelerated projec-tive steepest descent method for solving nonlinear inverse problems with an (cid:96) 1 constraint on the variable, which was recently proposed by Teschke and Borries (2010 Inverse Problems 26 025007). In their method, there are some parame- ters need to be estimated, which is a diﬃcult task for many applications. We overcome this diﬃculty by introducing a self-adaptive strategy in choosing the parameters. Theoretically, the convergence of their algorithm was guaranteed under the assumption that the underlying mapping F is twice Fr´echet diﬀer- entiable together with some other conditions, while we can prove weak and strong convergence of the proposed algorithm under the condition that F is Fr´echet diﬀerentiable, which is a relatively weak condition. We also report some preliminary computational results and compare our algorithm with that of Teschke and Borries, which indicate that our method is eﬃcient.


(Communicated by Yunmei Chen)
Abstract. In this paper, we propose a modification of the accelerated projective steepest descent method for solving nonlinear inverse problems with an 1 constraint on the variable, which was recently proposed by Teschke and Borries (2010 Inverse Problems 26 025007). In their method, there are some parameters need to be estimated, which is a difficult task for many applications. We overcome this difficulty by introducing a self-adaptive strategy in choosing the parameters. Theoretically, the convergence of their algorithm was guaranteed under the assumption that the underlying mapping F is twice Fréchet differentiable together with some other conditions, while we can prove weak and strong convergence of the proposed algorithm under the condition that F is Fréchet differentiable, which is a relatively weak condition. We also report some preliminary computational results and compare our algorithm with that of Teschke and Borries, which indicate that our method is efficient.
1. Introduction. We consider the computation of an approximate solution of the nonlinear inverse problem (1) F (f ) = y, where F : X → Y is a possibly ill-posed operator between Hilbert spaces X and Y . Since it is often the case that only noise data y δ with y δ − y ≤ δ are available, problem (1) has to be stabilized by regularization methods. A usually used regularization technique is to add a x p term to the least squares of the equation (1), where 0 ≤ p < 2. In many applications, the solution f of the equation (1) has a sparse series expansion f = λ∈Λ x λ φ λ with respect to a preassigned basis or frame {φ λ } λ∈Λ ⊂ X, i.e., the series expansion of f has only a very small number of nonzero coefficients x λ . This property can be characterized by choosing p = 0, leading to a NP-hard problem. Since the 1 -norm is the best convex approximation of the zero-norm in the unit 1 ball, p = 1 is currently the most popular choice.
In this paper, we focus our attention on the sparsity measure · 1 . That is, we consider the regularized problem where α > 0 is the parameter balancing the data-fidelity term and the 1 term.
For linear inverse problems, there are recently several approximation/regularization methods with sparsity constraints, e.g., in [2,3,4,6,7,11,13,20]. Generalize the classical Landweber iteration for linear inverse problems to solve (2), resulting in the following recursion (3) x n+1 where γ > 0 is a selected parameter and S α is the soft-shrinkage operator, which is defined as that for a scalar ξ S α (ξ) := ξ − min{α, |ξ|} ξ |ξ| , where ξ |ξ| = 0 if ξ = 0; for a vector x, S α (x) is also a vector whose i-th element is S α (x i ). This procedure, although is quite simple, is known to converge usually slowly. A direct generalization of (3) to the nonlinear inverse problems yields the iteration scheme (4) x n+1 which also converges slowly, as that observed by Ramlau and Teschke in [19]. One of the key factors that leads to the slow convergence of (3) for linear inverse problems is that, after an initially relatively fast convergence, the iterate traps in the overshooting of the 1 penalty and it takes very long to re-correct back. This observation inspired the authors in [5] to propose the intuitive way of forcing the iterates remaining within a particular 1 ball. That is, instead of solving the model (2), they propose to solve the 'constrained' one (5) min Iteration scheme (3) is correspondingly modified by replacing the soft-shrinkage operator S α with the operator P R , which is the projection operator on B R , resulting in the projected Landweber iteration For a closed convex set C and any x ∈ X, the projection of x onto C is defined as the unique point in C for which the 2 distance to x is minimal.
Recently, by generalizing the iteration scheme (7) to the nonlinear inverse problem, Teschke and Borries [21] proposed a new iteration procedure and managed to prove the convergence of their algorithm under the assumption that the operator F is twice Fréchet differentiable and the following three requirements hold true: Some numerical results were also presented to illustrate the algorithm. Here and throughout the paper, D(x) is the least squares of the equation (1) The detail of Teschke and Borries's algorithm is as follows [21].
Algorithm 1.1 Projected steepest descent method for nonlinear inverse problems.
1: Initialization. Choose some initial guess x 0 and R, and set Choose a parameter 0 < q < 1. 2: while a preassigned stopping criterion is not satisfied and the maximum number of iterations is not reached do 3: 4: x n+1 = P R x n + β n r F (x n+1 ) * (y − F (x n )) ; by fixed point iteration.

5:
If (B2) is satisfied increase n and go to 3; otherwise set β n = q · β n and go to 4. 6: end while Checking their algorithm, we find that the parameter β n plays the role of step size during the iterations, which is crucial to the efficiency the algorithm. From the numerical point of view, a larger of such a step size is more desirable. Thus, a larger of the right hand of (B2) in Algorithm 1.1 is needed. In this paper, we relax it by a factor 2(1 − δ) with 0 < δ < 1 and being close to 0, i.e., we replace (B2) by the following criterion More importantly, due to the trial point x n+1 is obtained in an implicit way, a suitable choice of the initial guess of β n is needed. The guess in the step 3 may be too large, at least for most steps. In this paper, we choose the initial guess of the β n as the one of the last iteration. To avoid too conservative of this choice, we allow to increase it when necessary. Such a 'self-adaptive' choice of the step size was adopted in [25] for solving the multiple-sets split feasibility problems, where the numerical results show that such a simple strategy can introduce much efficiency of the algorithm. Throughout our analysis, we just assume that the operator F is continuously differentiable, but not necessarily twice differentiable as that assumed in [21]. This weaker condition on guaranteeing convergence can enlarge the range of applications of the algorithm, which makes the algorithm more practical.
The rest of the paper is organized as follows. In the next section, we summarize some necessary preliminary results. In Section 3, we describe the algorithm formally and prove its weak and strong convergence in Section 3.1 and Section 3.2, respectively. We report some preliminary numerical results and compare the new algorithm with that of Teschke and Borries [21] in Section 4. Finally, we conclude the paper with Section 5.

2.
Preliminaries. Throughout the paper, we adopt the standard notation as that in [21]. For a countable index set Λ, we denote by p (Λ) the space of sequences x satisfying x p < ∞. For a given Hilbert space X, let {φ λ } λ∈Λ ⊂ X be a preassigned frame. Then there exist constants 0 < A ≤ B < ∞, This frame allows us to consider the so-called frame operator S : Consequently, the computation of a solution f is translated into finding a corresponding sequence x ∈ 2 (Λ). LetF := F • S * : 2 (Λ) → Y . Then, problem (5) can be formulated as In the rest of the paper, we still use F to denoteF which should not introduce ambiguity and focus our attention on the model Since we consider the constrained model (15), projection operator is often used. For a nonempty closed convex subset C of X and any vector x ∈ X, We now summarize some basic properties of the projection operator.
Lemma 2.1. Let C be a nonempty closed convex subset of X, for any u, v ∈ X and any w ∈ C, the following properties hold.
For the model (15), we have the following result, which provides a necessary condition for a point to be its minimizer. Lemma 2.2. Suppose that F is continuously Fréchet differentiable. If the vector x R ∈ 2 (Λ) is a minimizer of D(x) on B R , then for any γ > 0, which is equivalent to Since B R is convex, for any x ∈ B R , the feasible direction set at x is [1] Becausex R is a minimizer of D, any feasible direction should not be a descent direction of D, i.e., On the other hand, setting u :=x R + γF (x R ) * (y − F (x R )) and w :=x R in the second inequality of Lemma 2.1 yields Multiplying γ to (16) and adding the resulted inequality with (17) we get ). This completes the proof.
Let (18) e(x, γ) denote the residual error, then finding a minimizer of (15) is amount to finding a zero point of e(x, γ). In addition, e(x, γ) can be viewed as an error bound of measuring how much x fails to be a minimizer of (15), and serves as a stopping criterion of numerical methods.
For given x ∈ C, the magnitude e(x, γ) has the following property, which will be used in the following sections. and Proof. The proof skill is similar to that in [26], which is for the study of the monotone variational inequality problem. Setting w := P R (x +γF (x) * (y − F (x))) and u := x + γF (x) * (y − F (x)) in the second inequality of Lemma 2.1, and using the relation we have In a similar way, we can get Multiplying the first inequality byγ and the second one by γ, and then adding the resulting inequalities yields where the second inequality follows from the Cauchy-Schwarz inequality. Rearranging terms, we have and the assertion follows immediately.
The following assertion was proved in [21] under the conditions that F is twice Fréchet differentiable, and βL D(x) ≤ r/2. In the following, we prove a similar result under the weaker conditions that F is continuously Fréchet differentiable, and βL D(x) < r.
Lemma 2.4. Assume that F is continuously Fréchet differentiable and β > 0. For any fixed x ∈ B R assume (22) βL D(x) < r, and define the functional Φ β (·, x) by Then there exists a unique w ∈ B R that minimizes the restriction to B R of Φ β (w, x). We denote this minimizer byŵ which is given by Proof. For simplicity of notation, we define Since F is Fréchet differentiable, J(w) is continuously differentiable with For any w, w ∈ B R , w = w , we have which means that ∇J is strictly monotone. Here the first inequality follows from the Cauchy-Schwarz inequality, the second one from the Lipschitz continuity of F and the assumption (22). Consequently, J is strictly convex. Since B R is closed, convex, and J(w) is strictly convex, it has a unique minimizer on B R . From the first-order optimality condition, this minimizerŵ can be characterized as where γ is an arbitrary (but fixed) positive number. Setting γ := β/(2r), we obtain the desired assertion immediately.
The unique minimizer of Φ β is given in an implicit way, in the sense that both sides of the equality (24) involveŵ. To find an approximation of the minimizer, a simple way, as suggested in [21] is to adopt a fixed-point iteration (26) To ensure the convergence of such a fixed-point iteration, it is sufficient that the mapping Ψ is contractive. The following lemma shows this. (26) is contractive and as a consequence, the iteration (26) is convergent.
Proof. For any w, w ∈ B R , we have The proof is complete.
The fixed-point (26) may not an efficient scheme in findingŵ, since it is a projection method with a constant step size β/(2r). In applications, a variable step size selected in a self-adaptive manner is usually more favorable, i.e., instead of using (26), we suggest to use the iteration where α max ≥ α l ≥ α min > 0 is a parameter which is selected in a self-adaptive manner as suggested in [15,14], and the interested readers are referred to these two references for details. 3. The algorithm and its convergence. In this section, we first present the new algorithm for solving (15) and then analyze its convergence.
3.1. Algorithm description and properties. Following we describe the algorithm in detail. Algorithm 3.1 A self-adaptive projection method for nonlinear inverse problems.

Remark 1.
This new method is a modification of the accelerated projected steepest descent method proposed by Teschke and Borries [21]. Comparing these two methods, we can find that there are the following differences: • We require the operator F be continuously Fréchet differentiable, while Teschke and Borries [21] required F to be twice Fréchet differentiable. The weaker requirement enables our algorithm to be applied in wider fields. • The factor in our application criterion of β n is 1 − δ, while it is 1/2 in [21]. This slight change makes it possible to accept a larger β n , which may effect the algorithm's efficiency. • The manner in select the initial value of β n is different. In [21], the authors select β n to be max sup x∈B R D(x n ) , which may be too large, leading to a large number of iterations to find a suitable step size. While in our algorithm, we just select the initial guess using the information from the last iteration.
• More importantly, we adopt a 'self-adaptive' procedure in select the next trial of β n , i.e., Step S3. Such a self-adaptive strategy was first proposed by He et al. [15] for solving variational inequality problems with strongly monotone and Lipschitz continuous mappings. It was recently extended by Zhang et al. [25] to solve the multiple-sets split feasibility problem. The numerical results reported in these papers indicated that the strategy works very well. • Both algorithms need to find a fixed point of (24) which is used as the next iteration x n+1 . In [21], the authors used a constant step size β/(2r) while in our algorithm, we also adopt a variable step size selecting in a self-adaptive manner. This can enhance the performance of the algorithm greatly.
Since F is continuously Fréchet differentiable and r ≥ 2 sup x∈B R F (x) 2 (see (12)), it follows from the first-order Taylor expansion of F that This ensures that the inner procedure S2 terminates in a finite number of trials. In fact, there is a number β > 0, such that β n ≥ β for all n ≥ 0. Moreover, since 0 ≤ τ n and ∞ n=0 τ n < +∞, it follows that We conclude that implying that {β n } is also uniformly bounded from above. Hence, the whole algorithm is well-defined.
We have the following lemma for the sequence generated by the algorithm. Proof. The boundedness of {x n } follows trivially from the fact that {x n } ⊂ B R . We now pay our attention to the other two assertions.
Recall that x n+1 given by (29) is the unique minimizer of Φ β n+1 (w, x n ) on B R . Therefore, i.e., Rearranging terms, we obtain where the second inequality follows from (30) and the last one from β n+1 ≤β. Thus, {D(x n )} is monotonically decreasing.
Rearranging terms yields Summing both sides over all n further yields The proof is complete.
3.2. Weak convergence. In this subsection, we prove the weak convergence of the generated sequence. Since x n ∈ B R , it has at least one weak accumulation point.
Proof. Recall that or equivalently Since x n − x n+1 → 0 and {β n } ⊆ [β,β] ⊂ (0, ∞), it follows from the above inequality that The result x n − x n+1 → 0 and the continuity of F imply that and Since x * is a weak accumulation point of {x n }, there is a subsequence, e.g., {x nj } of {x n } converging weakly to x * . Hence, Specially, (37) holds for this subsequence, i.e., The rest proof is the same as that of Proposition 9 in [21] and we omit it here.
Proof. Denote by {x n l } the subsequence that was introduced in the proof of Lemma 3.2 and let {β n l } be a subsequence of step size. Define now u l = x n l − x * , v l = x n l +1 − x * and β l := β n l +1 . From the conclusion of Lemma 3.1, we have that lim l→∞ u l − v l = 0. On the other hand, where we have applied Lemma 3.2 and Lemma 2.2. By the assumptions on F and since the β l are uniformly bounded, we obtain where the last expression is due to (8). Similar, from (8)-(10) we have Consequently, combining u l − v l → 0 and the two last statements, we observe that Since the sequence β l is bounded, there must be at least one accumulation point, which we denote by β ∞ . We choose a subsequence {l j } ∈ N , such that Noticing that for every h * + u lj , there exists µ lj ≥ 0, such that By the same definition, there exists µ ≥ 0, such that P R (h * ) = µh * . Without loss of generality, we can assume h * > R, so µ < 1.
Because h * is bounded, there exists K > 0, such that |λ|>K |h * | λ ≤ h * 2 , and due to the weak convergence of {u nj }, there exists N 1 > 0, such that when n j ≥ N 1 , |λ|≤K |u nj | λ ≤ ε. For the same ε, from (39) there exists N 2 > 0, when n j ≥ N 2 , we deduce 4 h * ε ≥ |σ(n j ) − µ|, and as a consequence 1−µ Inverse Problems and Imaging Volume 10, No. 3 (2016), 689-709 Then u nj = |λ|≤K |u nj | λ + |λ|>K |u nj | λ ≤ (1 + 6 1−µ )ε, which means lim This completes the proof. 4. Numerical experiments. In this section, we test the performance of the new algorithm, and compare it with the method in [21] (i.e., Algorithm 1.1) to show its improvements over the latter one. The test problem here is a nonlinear sampling problem in [21]. This example comes from sensing setup in which a continuoustime signal is mapped by a memoryless, invertible and nonlinear transformation, and then sampled in a non-ideal manner. In [21], the authors specified the model to reconstruct f * (t) by the transformed function y = S • M • f * (t): In this example, we try to reconstruct f * (t) = a −2 (t) − 0.5a 2.5 (t). We set Z 1 = {0, 1, 2, · · · , m − 1} and Z 2 = {0, 1, 2, · · · , n − 1}. The larger m and n mean the more precise description of f * and y respectively, so we choose different m, n to show the efficiency of our algorithm in different cases.
We also reconstruct the inaccurate f * , which is obtained by measurements that contain noise. We set the noisy data f * (t) = a 5 (t) − 0.5a 1 (t) + e, where e = rand(size(f * (t))).
The radius R plays a key role in the model, as the regularization parameter α in (2). However, both R and α are quite difficult to choose in applications. We hence first settle down it before we compare the algorithms.  Table 1. From Table 1 we can observe that for R in the interval [1.3, 1.8], the results are better and we set R = 1.6 in the following test. To justify this, we also give the corresponding results when R = 1.0, 1.5, 1.8, 2.0 in Figure 1.  Case1. The obtained data is accurate. First, we show the case in which the data y is accurate. In Table 2, we report the running time and the number of iterations for n and m varying from 5 to 160. From this table we can see that for all cases, to achieve the same accuracy, our algorithm needs less number of iterations and less CPU time than Algorithm 1.1.
We now illustrate and compare the two algorithms in Figure 2 by plotting β k and the residual of the problem at each iteration in Algorithm 3.1 and Algorithm 1.1.
The first figures of all subfigures in Figure 2 show that β k in our algorithm is almost larger than β k in Algorithm 1.1. Just as we mentioned before, β k could be  considered as the stepsize in every iteration, and the larger β k means the larger probability of the high speed to the convergent point. In fact, our algorithm is faster than Algorithm 1.1, which shows our algorithm is better. The second figures of all subfigures in Figure 2 show the residual of the problem at each iteration in Algorithm 3.1 and Algorithm 1.1. From Figure 2 and Table 2, it is easy to see that our algorithm has the same result as Algorithm 1.1, and just like our theoretical analysis, the behavior of our algorithm is better than another one, which only needs less than half iterations of Algorithm 1.1, and needs less CPU time to achieve the same accuracy.
In our experiment, we use the iterative method to get x n+1 , the computing of it in every iteration is expensive. We present the inner iteration numbers in every iteration in Figure 3. From that figure, we could see that the average inner iteration number of our algorithm is not too difference from Algorithm 1.1. But the outer iteration number of our algorithm is smaller, this is another point explains that our algorithm is faster.
In Figure 4, we will show the signal which is reconstructed by Algorithm 3.1 and Algorithm 1.1 in comparison with the signal we obtained. We find that the signal reconstructed by our algorithm is the similar to the signal reconstructed by Algorithm 1.1.
One major relaxation of of Algorithm 3.1 is the use of a new step size rule that can provide higher possibility to use a larger step size, which can reduce the number of iterations of the algorithm. However, the larger of the initial guess of the step size, the slower the convergence of the 'inner' fixed point iteration. Nevertheless, since the inner fixed point iteration of Algorithm 3.1 is also in a self-adaptive manner, the inner number of iterations of Algorithm 3.1 is usually less than the method with constant step size too, as shown in Figure 3.
Case2. The obtained data is inaccurate. Now we will illustrate the case the observed data y contains noise. Figure 5 shows the value of β k and the residual of the problem at each iteration in Algorithm 3.1 and Algorithm 1.1, when the data which we need to be reconstructed contains noise. Figure 5 also shows that β k of our algorithm is almost larger than β k in Algorithm 1.1, which means our algorithm is faster than Algorithm 1.1.
From Table 3 and Figure 5, it is easy to see that our algorithm is still faster than Algorithm 1.1, when we have the same stopping rule in both algorithms.
We also present the inner iteration number of the computing of x n+1 in every iteration in Figure 6. From this figure, we could see that the average inner iteration number is not too different from Algorithm 1.1. Since the outer iteration number of our algorithm is smaller, that is another point explains that our algorithm is faster.      In Figure 7, we will show the signal which is reconstructed by Algorithm 3.1 and Algorithm 1.1 in comparison with the signal we observed with noise and without noise. We find that the signal reconstructed by our algorithm is better than the signal reconstructed by Algorithm 1.1.

5.
Conclusions. In this paper, we proposed an efficient projection method for solving the nonlinear inverse problems with sparsity constraint. Under the condition that the operator F be continuously Fréchet differentiable, a weaker condition than that in Teschke and Borries [21] which required F to be twice Fréchet differentiable, we managed to prove the convergence of the algorithm. Such a weaker requirement enables our algorithm to be applied in wider fields. Moreover, the factor in our application criterion of β n is 1 − δ, while it is 1/2 in [21]. This slight change makes it possible to accept a larger β n , which may effect the algorithm's efficiency. From the numerical point of view, we select the initial guess using the information from the last iteration and more importantly, we adopt a 'self-adaptive' procedure in selecting the next trial of β n , i.e., Step S3 in the algorithm. In the numerical experiments, we tested the algorithm, not only for the exact observed data, but also for the data with noise. Our results show that both our algorithm and that of Teschke and Borries [21] are quite robust to the noise.