AN INNER-OUTER REGULARIZING METHOD FOR ILL-POSED PROBLEMS

. Conjugate Gradient is widely used as a regularizing technique for solving linear systems with ill-conditioned coeﬃcient matrix and right-hand side vector perturbed by noise. It enjoys a good convergence rate and computes quickly an iterate, say x k opt , which minimizes the error with respect to the exact solution. This behavior can be a disadvantage in the regularization context, because also the high-frequency components of the noise enter quickly the computed solution, leading to a diﬃcult detection of k opt and to a sharp increase of the error after the k opt th iteration. In this paper we propose an inner-outer algorithm based on a sequence of restarted Conjugate Gradi-ents, with the aim of overcoming this drawback. A numerical experimentation validates the eﬀectiveness of the proposed algorithm.

1. Introduction. Given a matrix A ∈ R n×n and a vector b ∈ R n , we consider the system In many applications the available right-hand side of the system is contaminated by a noise η, i.e. b = b * +η. The vectors b * and x * , such that Ax * = b * , are considered the exact right-hand side and the exact solution of the system. Classical examples are problems arising from the discretization of Fredholm integral equations of the first kind [7], as for instance in the imaging deconvolution. We assume A to be a large ill-conditioned matrix, with singular values gradually decaying to zero. Because of the presence of η, here assumed to be an uncorrelated Gaussian white noise with distribution N (0, σ 2 I), the solution x = A † b is often a poor approximation of x * even if the magnitude of η is small. Better approximations to x * than x can be found by looking for regularized solutions of (1). Special regularization techniques have been devised to deal with this kind of problems exploiting both direct methods (as Tikhonov method) and iterative methods [1,7]. The iterative methods, which are suggested for large matrices A without particular structure properties, must enjoy the semi-convergence property, i.e. in presence of the noise they must reconstruct first the low-frequency components of the solution, in such a way that the iteration can be stopped before the high-frequency components of the noise start to enter the computed solution. Among the classical semi-convergent methods a major role is played by the conjugate gradient method (CG), whose regularizing properties are well known (see for example [6,7,11]). When A is not a symmetric positive definite matrix, CG can be applied to the normal equations The optimal number of iterations of CG is very sensitive to the perturbation of the right-hand side [2] and often the noise enters very quickly in the iterates following the optimal one. As a consequence, the regularizing efficiency of CG depends heavily on the effectiveness of the stopping rule employed. The discrepancy principle, widely used as a stopping rule, does not produce sufficiently accurate estimates of the stopping index. The use of the Generalized Cross Validation rule (GCV) [5,12,14] gives better results, even if in some cases it is not fully reliable.
In this paper we propose an inner-outer algorithm based on both Tikhonov method and CG, with the aim of overcoming the problems related to the detection of the best regularizing Tikhonov parameter and to the too fast entry of the noise after the optimal iteration of CG. This algorithm consists of two loops: in the outer loop a decreasing sequence of regularization parameters λ h , h ≥ 0, is generated. The aim of the hth outer step is to compute a regularizing solution of the system which minimizes the norm of the error with respect to x * . Namely, at the hth outer step a restarted CG is applied to system (3), starting with the last computed vector of the previous outer step until the stopping rule based on GCV is satisfied. The choice of the sequence λ h may be critical. We suggest a geometrically decreasing sequence suitably stopped. More precisely the outer loop is stopped when the results of two subsequent outer steps are equal. The proposed approach gains in reliability at the price of a computational cost greater than the cost of CG applied directly to (2). A similar approach has been proposed in [15] for estimating the Tikhonov parameter. Indeed the method of [15] consists of a restarted CG applied to a sequence of Tikhonov equations. The main differences reside in the stopping criteria both for the inner and the outer loop. Namely in [15] the inner loop goes on until convergence is achieved and the outer loop is stopped by using the damped Morozov's discrepancy. Also in [17] an algorithm with a similar structure has been described which produces a sequence of couples of approximate solutions and regularization parameters with the aim of minimizing the value of the Tikhonov functional instead of the norm of the error with respect to x * . Other approaches to adaptively generate a sequence of Tikhonov regularization parameters have been studied in literature. In [10,16,9] (see also the references therein) improved versions of the model function method, based on the damped Morozov's discrepancy, have been introduced. Unlike the methods based on the Morozov's discrepancy, our algorithm does not require the knowledge of the noise level.
The outline of our paper is the following: first, in Section 2 we compare the behaviors of Tikhonov method and CG. In Sections 3 and 4 we recall the properties of CG as a regularization method and of GCV as a stopping rule for CG. In Sections 5 the proposed inner-outer method is described and a sketch of its code is given. Finally, in Section 6 the results of an experimentation on the collection of test problems by Hansen [8] are presented and analyzed.
Notation. Throughout the paper, v denotes the Euclidean norm of a vector v, i.e. v 2 = v T v.

2.
Regularization. Let A = U ΣV T be the singular value decomposition of A, where U = [u 1 , . . . , u n ] ∈ R n×n and V = [v 1 , . . . , v n ] ∈ R n×n have orthonormal columns, and Σ = diag(σ 1 , . . . , σ n ), where the σ i for i = 1, . . . , n are the singular values of A. We have assumed that the σ i gradually decay toward zero, but in practice this means that they settle to values of the same magnitude of the machine precision. The condition number of A, given by κ = σ 1 /σ n is very large.
We consider the expansions of b * and η in the basis U and the expansions of The vectors v i are associated to frequencies which increase with i. Typically, the low-frequency components describe the overall features of x * and the high-frequency components represent its details. The high-frequency components of x can be dominated by the high-frequency components of the noise. An acceptable approximation can be obtained only if the quantities |u T i b * | decay to zero with i faster than the corresponding σ i (this condition is known as Picard discrete condition), at least until the numerical levelling of the singular values.
The prototype of direct regularization methods is Tikhonov method, which solves a penalized least-squares problem of the form where λ, called the regularization parameter, is a suitable positive scalar, L is a well-conditioned matrix often related to the discretization of a differential operator and γ is related to the level of the solution smoothness. The assumption that N (A) N (L) = {0} guarantees that (5) has a unique solution. To avoid an excessive computational effort, the structure of L should be fairly simple, for example with a small bandwidth, or even diagonal. If L = I a solution of (5) having minimal norm is sought and the problem is said to be in standard form. We consider here the problem in standard form with γ = 0, which is equivalent to solving the normal equations (3). Let x (λ) be the solution of (3). Tikhonov method aims at detecting the value λ T which minimizes the error E (λ) The solution x (λ) is given by (6) x The factors ϕ (λ) i are the Tikhonov filter factors. The condition number of system (3) is κ (λ) = (σ 2 1 + λ)/(σ 2 n + λ) to be compared with the condition number of system (2). The value λ T is linked to the singular values of A and to the magnitude η of the noise. This fact can be easily seen assuming the coefficients η i to be all of the same order, i.e. |η i | ∼ η / √ n for i = 1, . . . , n. In this case, from (4) and Of course, finding λ T in the case of a large matrix A is not feasible. Then, to detect a sufficiently good approximation of λ T , one should solve (3) for many values of λ and have a reliable estimator of the error of x (λ) with respect to the unknown x * .
A different approach consists in applying to system (2) an iterative semi-convergent method to compute a sequence x k , k = 1, 2, . . . , stopping the iteration before the high-frequency components of the noise start to enter the computed solution. In this sense the iteration index k plays the role of a regularization parameter. We consider here the conjugate gradient method (CG) for its good convergence rate. Let k opt be the optimal index, i.e. the index satisfying CG converges quickly, but this behavior can be a disadvantage in the regularization context, because also the high-frequency components enter quickly the computed solution, leading to a difficult detection of k opt and to a sharp increase of the error after the k opt th iteration.
At this point, a natural question arises: what happens if CG is applied to system (3) with λ varying on the set of nonnegative real numbers? Let x (λ) k be the sequence computed by CG applied to (3) for a selected value of λ and let k k is the one computed by CG for system (2), hence k To see what happen, we consider a problem randomly generated following the suggestions given in [7], Section 4.7. Matrix A has size n = 64, condition number κ = 10 20 , with singular values decaying as σ i ≈ 0.7 i . A white Gaussian noise η of magnitude 0.01 is added to the right-hand side b * . We apply Tikhonov method to this problem with λ varying in the interval [0.001, 0.04] and for each λ we compute the error E (λ) T (black points in Fig. 1 left). Then we apply CG to system (3) starting with x 0 = 0 and compute the errors E Fig.   1 left).
Both the error functions decrease until their minimum points kopt , then increase. The two minimum points differ and λ C < λ T (for an explanation of this fact, see Section 3). As expected, the best E T , meaning that a better result could be obtained by applying CG to (3) if the correct value λ C was known and a good stopping rule was available. This fact appears interesting, but we think that an equally important issue concerns the error behavior of CG applied to (3) varying λ. The better conditioning of A T A+λ I with respect to A T A slows down the increase of the error after the k (λ) opt th iteration. Fig. 1 (on the right) shows the error histories E The different behavior of the errors after the minimum is evident: in the first case the error increases sharply, in the second case the error remains nearly constant for a long while. This advantageous behavior is obtained at the expense of a moderate slowing down of the convergence rate.
3. CG applied to system (3). We examine first how CG applies to solve (3) with different values of the parameter λ > 0. To avoid a too cumbersome notation, we omit an explicit indication of λ in the vectors computed by CG. Denoting M (λ) = A T A + λI and g = A T b, system (3) becomes Let x 0 be a starting point and q 0 = g − M (λ) x 0 the corresponding residual in system (7). CG is a projection method on the Krylov subspace It builds a sequence of vectors x k ∈ x 0 +K (λ) k , for k = 1, 2, . . ., such that x k −x k−1 ∈ K (λ) k , and the residual q k = g − M (λ) x k is orthogonal to K (λ) k , i.e. < q k , w >= 0, for any w ∈ K (λ) k . In details, setting p 0 = q 0 , at the kth step the method computes function CGstep(λ, k) The regular behavior of CG should be monitored, by checking for example breakdowns and evident instability. For the sake of simplicity these controls are not included in the code shown above.
An orthonormal basis W k = w 1 , . . . , w k of K (λ) k can be constructed by applying k steps of the symmetric Lanczos algorithm to the vector can be viewed as the representation of M (λ) projected on K (λ) k . Its entries are γ 1 = 1/α 0 and If x 0 is chosen as the null vector, then q 0 = g does not depend on λ and does not depend on λ and we can drop the superscript. This remains true also if x 0 is not the null vector, as long as it belongs to K h for a h < k. On the contrary, the matrix T (λ) k depends on λ and it is easy to see that are the Ritz values of M (λ) at the kth iteration. From (10) it follows that k,j + λ. The regularization properties of CG applied to (7), whose solution is x (λ) , can be derived by using the expression of its filter factors given in [7], p.146. If x 0 = 0, (6). Then (12) x i . To obtain the same regularizing effect we would get by applying Tikhonov method with the optimal parameter λ T , a value for λ in (12) should verify ψ . It follows that the optimal value λ C for (12) must verify λ C < λ T . The combined effect of the CG filter factors ψ (λ) k,i and of the Tikhonov filter factors entails a slowing down in the growth of the filter factors of CG applied to system (7), allowing more iterations before the noise sets in, and, more important from our point of view, avoiding that the filter factors corresponding to the smallest σ i grow too much. Hence more lowfrequency components can be reconstructed (see Fig. 1 on the left) and the error growth after the optimal iteration is reduced. The different error history growths for λ = 0 and λ = 0 in Fig.1 (on the right) point out this behavior.
Remark. From the sequence of parameters α k and β k computed by function CGstep the entries γ k and δ k of the matrix T 4. GCV as a stopping rule. The choice of the index at which the iteration should be stopped is critical. Ideally, this should happen when the error x * − x k is minimum, but this error is not directly computable. An acceptable approximation of the optimal index could be sought through the predictive error π k = Ax * − Ax k , but also this error is not directly computable. However it is possible to get a reliable estimate of π k stochastically by using the Generalized Cross Validation method (GCV) described in [5,14]. Let x k = R k (b) be the vector computed at the kth step by an iterative method. The operator A k (b) = AR k (b) is called influence operator. For certain iterative regularization methods (for example Landweber method) the operator A k depends linearly on b and an explicit simple form is available, but this is not the case of CG. To overcome this difficulty, if the function R k (b) is continuously differentiable in the neighborhood of b (and this is true for CG applied to (7)), in [12] it is suggested to replace R k (b) by its first order approximation J k b, where J k is the Jacobian matrix of x k with respect to b, thus With this approximation Ax k ∼ A k b, where A k = AJ k is the approximate influence matrix. The GCV functional for CG is defined as where r k = b − Ax k is the residual of x k in system (1) and tr(B) indicates the trace of the matrix B. If initially x 0 = 0, we have V 0 = n r 0 2 / tr (I) The minimizer k V of V k is the GCV estimate for the optimal index. The quantity tr (I −A k ) in the denominator of (13) is considered as the effective number of degrees of freedom.
To compute V k a technique for approximating the trace of A k is required. For large dimensions, the Trace Lemma [13] is generally exploited: if v is any vector of n independent samples of a normal random variable with zero mean and unit variance, then v T A k v is an unbiased estimator for tr A k , i.e. (14) tr For a practical implementation of (14), we need: (a) one or more vectors v ∼ N (0, I) [13]. According to [4], in most of the current problems, when matrix A k is symmetric and n is large enough, only one vector v is needed to get a reliable estimate. From now on, by v we indicate always a vector of this type. (b) a way to compute the vector s k = J k v. The (j, h)th component of J k is given by ∂(x k ) j ∂b h , and following [3] the product J k v can be computed recursively differentiating the recursive definition of x k given in the function CGstep. The computation of the quantities at the kth step is then performed in the following way Initially, if x 0 = J 0 b, then s 0 = J 0 v and u 0 = t 0 = A T v − M (λ) s 0 . This procedure for evaluating tr A k doubles the computational cost of CG in terms of the number of matrix-vector products.

5.
The inner-outer algorithm. In Section 2 we have seen that applying CG to system (3) can produce better results than both CG applied to system (2) and Tikhonov method. However this is true only for a limited choice of values for the parameter λ. To avoid such a critical issue, we propose the following ACG algorithm: (1) Initialize λ 0 , x (λ0) 0 and δ; (2) set h = 0; (3) while an outer stopping condition is not satisfied do: Here are some details about the steps (1) and (3) of ACG.
Step (1). We have already noted in Section 2 that the optimal value λ T for Tikhonov method is not far from a singular value σ 2 k of A T A, so we assume that the optimal value for λ when CG is applied to (3) should also be close to a singular value of A T A. This assumption suggests to consider an exponential decreasing sequence of the form λ h = λ 0 δ h , with suitable λ 0 and δ < 1. At the beginning, we consider λ 0 = γ 1 = Ag 2 / g 2 which is the first value of the sequence given in (8) when the CG is applied to system (2) and represents an approximation to the first eigenvalue of A T A. For what concerns the value of δ which rules the decreasing of the parameter λ, we suggest δ = 0.1 since, in our experimentations, it represents a good tradeoff between the regularization capability and the convergence rate of the algorithm. Other choices for λ 0 and δ, ruled by information about the spectral structure of the matrix A T A and the magnitude of η, could be more suitable. In particular an idea on the decreasing rate of the spectrum of A T A could be acquired according to the Remark at the end of Section 3.
Step (3). To find x we apply a restarted CG method starting from x k h−1 . CG iteration is stopped through a stopping rule (implemented by means of the Boolean inner stop cond in the following code) based on GCV, i.e. it is stopped when the GCV functional increases, detecting in this way a regularized solution of system (3) for λ = λ h . It is true that the regularized solutions x (λ h ) k h , for h = 0, 1, . . . , correspond to different values of the parameter λ, but as noted in Section 3 they belong all to the same Krylov space K k defined in (9). For this reason, the sequence obtained by joining the sequences of vectors computed in subsequent steps of the while loop can be seen as produced by a unique iterative projection method. It follows that also the sequences of the GCV functionals V k and of the derivatives s k can be linked together. The outer stopping condition is satisfied for the first h such that k h = 0, meaning that two subsequent steps of the while loop produce the same vector, i.e. x The condition is monitored in the following code by the Boolean outer stop cond. The number of iterations is kept under control by means of the condition h < h max , where h max is a given constant.
To avoid long lists of arguments, all the constructed vectors are assumed to belong to a global working space.
( approximation of σ 2 1 ) δ = 0.1; outer stop cond = true; while outer stop cond Note that a decreasing sequence of parameters λ h has been considered also in algorithm Descent-TCG of [17], together with a specific stopping rule, with the aim of minimizing the functional φ x, λ defined in (5). Both the updating of the parameters and the stopping rule of [17] are not suitable to our algorithm, which aims at minimizing x 6. Numerical experiments. The numerical experimentation has been carried out in Mathematica with machine-epsilon 2 −53 arithmetic on 11 problems of [8] at size n = 64 and n = 256. These problems are obtained from the discretization of Fredholm integral equations of the first kind. For each problem the matrix A and the solution x * were given, then the vector b * = Ax * was computed and six white Gaussian noises were generated, with relative magnitudes N SR = η / b * ranging from 0.001 to 0.05. To each sample, i.e. a triple (A, b * , η), both simple CG with GCV stopping rule (CG+GCV in the following) and ACG have been applied. The solutions obtained by CG+GCV and by ACG are denoted x CG+ and x ACG respectively. Also the theoretical optimal solution x kopt has been computed with CG as a reference. The corresponding errors with respect to the theoretical solution are denoted In order to highlight the behavior of ACG, two problems of size n = 64 have been selected, namely shaw with N SR = 0.01 and ilaplace with the solution x * corresponding to the forth example and N SR = 0.001. In Figure 2 we show the error histories for GC (solid and dotted line) and ACG (dashed line) applied to problems shaw (on the left) and ilaplace (on the right). The transition point between solid line and dots indicates where CGV stops the iteration of CG. For both problems, x ACG outperforms x CG+ and is even better than x kopt as can be seen in Table 1.  These problems illustrate two different possible cases where the use of ACG can be worthwhile: in the first case (shaw) the error of CG+GCV is very close to the optimal error E kopt , but ACG gets a substantial reduction; in the second case (ilaplace) CG+GCV has a poor performance because GCV does not provide a good stopping index, while ACG reaches an error lower than E kopt .
However ACG does not guarantee always an improvement in the accuracy of the approximate solution with respect to CG+GCV. Considering the whole set of test problems, for each sample the quantity (15) e = E ACG E CG+ − 1 is used as an indicator of the performance of the algorithm. We judge that the two methods produce approximations with comparable accuracy when |e| ≤ for = 0.02. The behavior of ACG on the two test sets, each consisting of 66 samples, for n = 64 and n = 256 is shown in Table 2 Table 2.
Analysis of the behavior of ACG compared with CG+GCV on the whole test set for n = 64 and n = 256.
The columns of the table give the percentages of cases for which ACG strongly outperforms CG+GCV (e < −10 ), outperforms CG+GCV (−10 ≤ e < − ), is comparable to CG+GCV, is outperformed by CG+GCV ( < e ≤ 10 ) and is strongly outperformed by CG+GCV (10 < e). The above results show that, in a high percentage of cases, ACG produces approximations more accurate than CG+GCV, even if at a greater cost. More precisely the analysis of the computational costs of the two algorithms, in terms of the number of iterations, shows that the average cost of ACG on the two test sets is four/five times that of the usual CG+GCV method. Such multiplicative factor in the cost of ACG is not so heavy when compared to the number of trials required by the most popular methods for the detection of the regularization Tikhonov parameter.

7.
Conclusions. The regularization problems are intrinsically difficult, because all the heuristics suggested for estimating the optimal regularizing parameter may turn out sometimes unreliable. Our proposal joins the advantage of using an iterative method (essential with large dimensions matrices) with a lower criticality in detecting the stopping point.