Risk Estimators for Choosing Regularization Parameters in Ill-Posed Problems - Properties and Limitations

This paper discusses the properties of certain risk estimators recently proposed to choose regularization parameters in ill-posed problems. A simple approach is Stein's unbiased risk estimator (SURE), which estimates the risk in the data space, while a recent modification (GSURE) estimates the risk in the space of the unknown variable. It seems intuitive that the latter is more appropriate for ill-posed problems, since the properties in the data space do not tell much about the quality of the reconstruction. We provide theoretical studies of both estimators for linear Tikhonov regularization in a finite dimensional setting and estimate the quality of the risk estimators, which also leads to asymptotic convergence results as the dimension of the problem tends to infinity. Unlike previous papers, who studied image processing problems with a very low degree of ill-posedness, we are interested in the behavior of the risk estimators for increasing ill-posedness. Interestingly, our theoretical results indicate that the quality of the GSURE risk can deteriorate asymptotically for ill-posed problems, which is confirmed by a detailed numerical study. The latter shows that in many cases the GSURE estimator leads to extremely small regularization parameters, which obviously cannot stabilize the reconstruction. Similar but less severe issues with respect to robustness also appear for the SURE estimator, which in comparison to the rather conservative discrepancy principle leads to the conclusion that regularization parameter choice based on unbiased risk estimation is not a reliable procedure for ill-posed problems. A similar numerical study for sparsity regularization demonstrates that the same issue appears in nonlinear variational regularization approaches.

Abstract. This paper discusses the properties of certain risk estimators that recently regained popularity for choosing regularization parameters in ill-posed problems, in particular for sparsity regularization. They apply Stein's unbiased risk estimator (SURE) to estimate the risk in either the space of the unknown variables or in the data space. We will call the latter PSURE in order to distinguish the two different risk functions. It seems intuitive that SURE is more appropriate for ill-posed problems, since the properties in the data space do not tell much about the quality of the reconstruction. We provide theoretical studies of both approaches for linear Tikhonov regularization in a finite dimensional setting and estimate the quality of the risk estimators, which also leads to asymptotic convergence results as the dimension of the problem tends to infinity. Unlike previous works which studied single realizations of image processing problems with a very low degree of ill-posedness, we are interested in the statistical behaviour of the risk estimators for increasing ill-posedness. Interestingly, our theoretical results indicate that the quality of the SURE risk can deteriorate asymptotically for ill-posed problems, which is confirmed by an extensive numerical study. The latter shows that in many cases the SURE estimator leads to extremely small regularization parameters, which obviously cannot stabilize the reconstruction. Similar but less severe issues with respect to robustness also appear for the PSURE estimator, which in comparison to the rather conservative discrepancy principle leads to the conclusion that regularization parameter choice based on unbiased risk estimation is not a reliable procedure for ill-posed problems. A similar numerical study for sparsity regularization demonstrates that the same issue appears in non-linear variational regularization approaches.

1.
Introduction. Choosing suitable regularization parameters is a problem as old as regularization theory, which has seen a variety of approaches both from deterministic (e.g. L-curve criteria, [23,22]) or statistical perspectives (e.g. Lepskij principles, [3,26]), respectively in between (e.g. discrepancy principles motivated by deterministic bounds or noise variance, cf. [38,4]). While the particular class of statistical parameter choice rules based on unbiased risk estimation (URE) was used for linear inverse reconstruction techniques early on [35,37,17], there is a renewed interest in these approaches for iterative, non-linear inverse reconstruction techniques, in particular in the context of sparsity constraints, see e.g., [33,15,42,19,30,43,12,13,14,40,44,11,39]). These works are based on extending Stein's general construction of an unbiased risk estimator [36] to the inverse problems setting. Compared to approaches that measure the risk in the data space, the classical SURE and a generalized version (GSURE, [15,19,13,40]) measure the risk in the space of the unknown which seems more appropriate for ill-posed problems. Previous investigations show that the performance of such parameter choice rules is reasonable in many different settings (cf. [21,45,9,2,34,31,15]). However, most of the problems considered in these works are very mildly ill-posed (which we will define more precisely below), the interplay between ill-posedness and the performance of the risk estimators is not studied explicitly and the inherent statistical nature of the selected regularization parameter is ignored as only single realizations of noise are typically considered.
Therefore, a first motivation of this paper is to further study the properties of SURE in Tikhonov-type regularization methods from a statistical perspective and systematically in dependence of the ill-posedness of the problem. For this purpose we provide a theoretical analysis of the quality of unbiased risk estimators in the case of linear Tikhonov regularization. In addition, we carry out extensive numerical investigations on appropriate model problems. While in very mildly illposed settings the performances of the parameter choice rules under consideration are reasonable and comparable, our investigations yield various interesting results and insights in ill-posed settings. For instance, we demonstrate that SURE shows a rather erratic behaviour as the degree of ill-posedness increases. The observed effects are so strong that the meaning of a parameter chosen according to this particular criterion is unclear.
A second motivation of this paper is to study the discrepancy principle as a reference method and as we shall see it can indeed be put in a very similar context and analysed by the same techniques. Although the popularity of the discrepancy principle is decreasing recently in favour of choices using more statistical details, our findings show that it is still more robust for ill-posed problems than risk-based parameter choices. The conservative choice by the discrepancy principle is wellknown to rather overestimate the optimal parameter, but on the other hand it avoids to choose too small regularization as risk-based methods often do. In the latter case the reconstruction results are completely deteriorated, while the discrepancy principle yields a reliable, though not optimal, reconstruction.
Formal Introduction. We consider a discrete inverse problem of the form where y ∈ R m is a vector of observations, A ∈ R m×n is a known matrix, and ε ∈ R m is a noise vector. We assume that ε consists of independent and identically distributed (i.i.d.) Gaussian errors, i.e., ε ∼ N (0, σ 2 I m ). The vector x * ∈ R n denotes the (unknown) exact solution to be reconstructed from the observations. There are two potential difficulties for this: If A has a non-trivial kernel, e.g. for n > m, we simply cannot observe certain aspects of x * and regularization has to interpolate them from the observed features in some way. This is, however, not the ill-posedness we are interested in here. In practice, we often know what we miss, i.e, the structure of the kernel, and we consider these problems only "mildly ill-posed". The second difficulty is more subtle: The singular values of A might decay very fast, which means that certain aspects of x * are barely measurable and even small additional noise ε can render their recovery unstable. This is the main difficulty we are interested in here, so we will measure the degree of ill-posedness of (1) by the condition of A restricted to its co-kernel, i.e. the ratio between largest and smallest non-zero singular value. Note that this definition deviates from the classical definition of ill-posedness for continuous problems by Hadamard [20], which leads to a binary classification of problems as either well-or ill-posed but is not very useful for practical applications. In order to find an estimatex(y) of x * from (1), we apply a variational regularization method: where R is assumed convex and such that the minimizer is unique for positive regularization parameter α > 0. In what follows the dependence ofx α (y) on α and the data y may be dropped where it is clear without ambiguity thatx =x α (y).
In practice there are two choices to be made: First, a regularization functional R needs to be specified in order to appropriately represent a-priori knowledge about solutions and second, a regularization parameter α needs to be chosen in dependence of the data y. The ideal parameter choice would minimize a difference betweenx α (y) and x * over all α, which obviously cannot be computed and is hence replaced by a parameter choice rule that tries to minimize a worst-case or average error to the unknown solution, which can be referred to as a risk minimization. In the practical case of having a single observation only, the risk based on average error needs to be replaced by an estimate as well, and unbiased risk estimators that will be detailed in the following are a natural choice.
For the sake of a clearer presentation of methods and results we first focus on linear Tikhonov regularization, i.e., leading to the explicit Tikhonov estimator In this setting, a natural distance for measuring the error ofx α (y) is given by its 2 -distance to x * . Thus, we define as the optimal, but inaccessible, regularization parameter. Many different rules for the choice of the regularization parameter α are discussed in the literature.
Here, we focus on strategies that rely on an accurate estimate of the noise variance σ 2 . A classical example of such a rule is given by the discrepancy principle: The regularization parameterα DP is given as the solution of the equation The discrepancy principle is robust and easy-to-implement for many applications (cf. [5,24,32]) and is based on the heuristic argument, thatx α (y) should only explain the data y up to the noise level. The broader class of unbiased risk estimators accounts for the stochastic nature of ε by aiming to choose α such that it minimizes certain 2 -errors betweenx α (y) and x * only in expectation: We first define the mean squared prediction error (MSPE) as and refer to its minimizer asα MSPE . Since MSPE depends on the unknown vector x * , we have to replace it by an unbiased estimate we will call PSURE here and define:α with df α (y) = tr (∇ y · Ax α (y)) .
While the classical SURE estimator would try to estimate the expectation of the simple 2 -error betweenx α (y) and x * like in (4), a generalization [15,19] is often considered in inverse problems where A may have a non-trivial kernel: We define the mean squared estimation error (MSEE) here as PARAMETERS IN ILL-POSED PROBLEMS   5 where Π := A + A denotes the orthogonal projector onto the range of A * ( M + denotes the Pseudoinverse of M ), and refer to the minimizer of MSEE(α) asα * SURE . Again, we replace MSEE by an unbiased estimator to obtain

RISK ESTIMATORS FOR CHOOSING
with gdf α (y) = tr((AA * ) + ∇ y Ax α (y)), If A is non-singular, as it will be in the theoretical analysis and numerical experiments in this work, the above definition coincides with the classical one considered by Stein [36]. Note that the main difference between the two risk functions MSPE and MSEE and their corresponding estimators PSURE and SURE is that they measure in image and domain of the ill-conditioned operator A, respectively. The second important observation here is that all parameter choice rules depend on the data y and hence on the random errors ε 1 , . . . , ε m . Therefore,α DP ,α PSURE andα SURE are random variables, described in terms of their probability distributions. In the next section, we first investigate these distributions by a numerical simulation study in a simple inverse problem scenario using quadratic Tikhonov regularization. The results point to several problems of the presented parameter choice rules, in particular of SURE, and motivate our further theoretical investigation in Section 3. The theoretical results will be illustrated and supplemented by an exhaustive numerical study in Section 4. Finally we extend the numerical investigation in Section 5 to a sparsity-promoting LASSO-type regularization, for which we find a similar behaviour. Conclusions are given in Section 6.
2. Risk Estimators for Quadratic Regularization. In the following we discuss the setup in the case of the simple quadratic regularization functional R(x) = 1 2 x 2 , i.e. we recover the well-known linear Tikhonov regularization scheme. The linearity can be used to simplify arguments and gain analytical insight in the next section. While the arguments presented can easily be extended to more general quadratic regularizations, this model already contains all important properties.

Singular System and Risk Representations.
Considering a quadratic regularization allows to analyzex α in a singular system of A in a convenient way. Let r = rank(A), q = min(n, m). Let unitary. Defining we can rewrite model (1) in its spectral form whereε 1 , . . . ,ε m are still i.i.d. ∼ N (0, σ 2 ). All quantities considered in the following depend on n or m. In particular, we have A = A n,m , y = y m , x * = x * n , γ i = γ n,m , x * i = x * i,n,m andε i =ε i,n,m . This dependence is made explicit in the statements of the results and technical assumptions for clarity but is dropped in the main text for ease of notation. Increasing m corresponds to sampling from an equation such as (1) more finely, whereas an increase in n increases the level of discretization of an operator A ∞ (see section 2.2). In our asymptotic considerations both n and m tend to infinity. We will express some more terms in the singular system that are frequently used throughout this paper. In particular, we have for x M L , the regularized solutionx α (dropping the dependence on y below for notational simplicity) and its norm as well as the residual and distance to the maximum likelihood estimate Based on the generalized inverse we compute which yields the degrees of freedom and the generalized degrees of freedom Next, we derive the spectral representations of the parameter choice rules. For the discrepancy principle, we use (13) to define IN ILL-POSED PROBLEMS   7 and now, (5) can be restated as DP(α DP , y) = 0. For (7) and (9), we find

RISK ESTIMATORS FOR CHOOSING PARAMETERS
2.
2. An Illustrative Example. We consider a simple imaging scenario which exhibits typical properties of inverse problems. The unknown function x * ∞ : [−1/2, 1/2] → R is mapped to a function y ∞ : [−1/2, 1/2] → R by a periodic convolution with a compactly supported kernel of width l ≤ 1/2: and continued periodically for |t| > 1/2. Examples of k l (t) are plotted in Figure  1(a). The normalization ensures that A ∞,l and suitable discretizations thereof have the spectral radius γ 1 = 1 which simplifies our derivations and the corresponding illustrations. The x * ∞ used in the numerical examples is the sum of four delta distributions: The locations of the delta distributions approximate [−0.3, −0.2, 0.1, 0.3] by irrational numbers which will simplify the discretization of this continuous problem.
Discretization. For a given number n ∈ N, let denote the equidistant partition of [−1/2, 1/2] and ψ n i (t) = √ n 1 E n i (t) an orthonormal basis (ONB) of piecewise constant functions over that partition. If we use m and n degrees of freedom to discretize range and domain of A ∞,l , respectively, we arrive at the discrete inverse problem (1) with The two dimensional integration in (17) is computed by the trapezoidal rule with equidistant spacing, employing 100 × 100 points to partition E m i × E n i . Note that we drop the subscript l from A l whenever the dependence on this parameter is not of importance for the argument being carried out. As the convolution kernel k l has mass 1 and the discretization was designed to be mass-preserving, we have γ 1 = 1 and the condition number of A is given by cond(A) = 1/γ r , where r = rank(A). Figure 2 shows the decay of the singular values for various parameter settings and   Table 1 lists the corresponding condition numbers: From this, we can see that the degree of ill-posedness of solving (1) measured in terms of the rate of decay of the singular values and the condition number grows very fast with increasing m and l.
It is easy to show that in the infinite dimensional setting, the rate of decay would be exponentially fast.
Distributions. Using the above formulas and m = n = 64, l = 0.06, σ = 0.1, we computed the empirical distributions of the α values selected by the different parameter choice rules by evaluating (14), (15) and (16) on a fine logarithmical α-grid, i.e., log 10 (α i ) was increased linearly in between −40 and 40 with a step size of 0.01. We draw N ε = 10 6 samples of ε. The results are displayed in Figures 3 and 4: In both figures, we use a logarithmic scaling of the empirical probabilities wherein empirical probabilities of 0 have been set to 1/(2N ε ). While this presentation complicates the comparison of the distributions as the probability mass is deformed, it facilitates the examination of small values and tails. First, we observe in Figure 3(a) thatα DP typically overestimates the optimal α * . However, it performs robustly and does not cause large 2 -errors as can be seen in Figure 3(b). Forα PSURE andα SURE , the latter is not true: While being  Figure 2. Decay of the singular values γ i of A l for different choices of m = n and l. As expected, increasing the width l of the convolution kernel leads to a faster decay. For a fixed l, increasing m corresponds to using a finer discretization and γ i converges to the corresponding singular value of A ∞,l , as can be seen for the largest γ i , e.g., for l = 0.02. Table 2. Statistics of the 2 -error x * − xα 2 for different parameter choice rules using m = n = 64, l = 0.06, σ = 0.1 and N ε = 10 6 samples of ε. closer to α * thanα DP most often, and, as can be seen from the joint error histograms in Figure 4, producing smaller 2 -errors more often (87%/56% of the time for PSURE/SURE), both distributions show outliers, i.e., occasionally, very small values ofα are estimated that cause large 2 -errors. In the case ofα SURE , we even observe two clearly separated modes in the distributions. Table 2 shows different statistics that summarize the described phenomena. These findings motivate the theoretical examinations carried out in the following section.    Figure 3(b) are the marginal distributions thereof). As in Figure 3(b), the logarithms of the probabilities are displayed (here in form of a color-coding) to facilitate the identification of smaller modes and tails. The red line at x = y divides the areas where one method performs better than the other: In (a), all samples falling into the area on the right of the red line correspond to a noise realization where the discrepancy principle leads to a smaller error than PSURE. The percentage of samples for which that is true is 13% for PSURE and 44% for SURE. 3. Properties of the Parameter Choice Rules for Quadratic Regularization. In this section we consider the theoretical (risk) properties of PSURE, SURE and the discrepancy principle. To allow for a concise and accessible presentation of the main results, all proofs are shifted to Appendix A. As we are investigating random quantities, convergence rates are given in terms of the stochastic order symbols O P and o P , which correspond to Landau's big O and small o notation, respectively, when convergence in probability is considered. Let us recall the definition of O P and o P using the formulation in [41], Chapter 2.1.
Definition 3.1 (Stochastic Order Symbols). Let (Ω, F, P) a probability space. Z n : Ω → R, n ∈ N, be a sequence of random variables, and (r n ) n∈N be a sequence of positive numbers. We say that We say that Assumption 1. For the sake of simplicity we only consider m = n in this first analysis. Furthermore, we assume Note that all assumptions are fulfilled in the numerical example we described in the previous section.
We mention that we consider here a rather moderate size of the noise, which remains bounded in variances as m → ∞. A scaling corresponding to white noise in the infinite dimensional limit is rather σ 2 ∼ m and an inspection of the estimates below shows that the risk estimate is potentially far from the expected values in such cases additionally.
3.1. PSURE-Risk. We start with an investigation of the PSURE risk estimate. Based on (15) and Stein's result, the representation for the risk is given as Figure 5(b) illustrates the typical shape of MSPE(α) and PSURE estimates thereof. Following [29,25] who considered the case A = I m and [46,18], who investigated the performance of Stein's unbiased risk estimate in the different context of hierarchical modeling, we show that, with the definition of the loss L by Note that PSURE is an unbiased estimate of the expectation of L.
Theorem 3.2. If Assumption 1 holds, then we have for any sequence of vectors Remark 1. The result of Theorem 3.2 guarantees stochastic boundedness of the sequence √ m sup It does not entail the existence of a proper weak limit of PSURE, which would require stronger assumptions on the sequences (x * m ) m∈N and (γ i,m ) m i=1 m∈N . The latter result can be used to show that, in an asymptotic sense, if the loss L is considered, the estimatorα PSURE does not have a larger risk than any other choice of regularization parameter. This statement is made precise in the following corollary. We finally mention that our estimates are rather conservative, in particular with respect to the quantity Sl 3 (α) in the proof of Theorem 3.2, since we do not assume particular smoothness of x * . With an additional source condition, i.e., certain decay speed of the x * i , it is possible to derive improved rates, which are however beyond the scope of our paper. We refer to [10] and [27] for recent results in that direction, where optimality of xα PSURE with respect to the risk MSEE under source conditions for spectral cut-off and more general, filter based methods are shown, respectively.
We turn our attention to the convergence of the risk estimate as m → ∞ as well as the convergence of the estimated regularization parameters.
In order to understand the behaviour of the estimated regularization parameters we start with some bounds onα MSPE , which recover a standard property of deterministic Tikhonov-type regularization methods, namely that σ 2 α does not diverge for suitable parameter choices (cf. [16]).
is equicontinuous on sets [C 1 , C 2 ] with 0 < C 1 < C 2 and hence has a uniformly convergent subsequence f m k with continuous limit function f .
In order to obtain convergence of minimizers it suffices to be able to choose uniform constants C 1 and C 2 , which is possible if the bounds in Lemma 3.4 are uniform: whereas MSPE m (α, y) is bounded away from zero by the assumptions of Theorem 3.5. Therefore, asymptotically, minimizing MSPE m is the same as minimizing PSURE .

Discrepancy Principle.
We now turn our attention to the discrepancy principle, which we can formulate in a similar setting as the PSURE approach above. With a slight abuse of notation, in analogy to the other methods, we denote the mean of DP(α, y) by MDP(α) (mean discrepancy) and defineα MDP as the solution of the equation 3.3. SURE-Risk. Now we consider the SURE-risk estimation procedure. Figure  5(c) illustrates the typical shape of MSEE(α) and SURE estimates thereof. Based on (16), if γ m > 0 for all m, the risk can be written as For the PSURE criterion we showed in Theorem 3.2 that PSURE(α, y) is close to the loss L in an asymptotic sense with the standard √ m-rate of convergence. An analogous result can be shown for SURE and the associated lossL(α) := c m Π(x * −x α ) 2 2 but with different associated rates of convergence c m , dependent on the singular values.
In the same manner as for PSURE, we may use the latter convergence result to show that, in an asymptotic sense, if the lossL is considered, the estimatorα SURE does not have a larger risk than any other choice of regularization parameter. We stress again that this optimality property depends on the loss considered, as it is the case in Corollary 1. We can now proceed to an estimate between SURE and MSEE similar to the ones for the PSURE risk, however we observe a main difference due to the appearance of the condition number of the forward matrix A: Theorem 3.8. Let A m ∈ R m×m be a full rank matrix. In addition to Assumption 1, let γ m,m > 0 for all m and γ m,m → 0. Then, we have for any sequence of vectors We finally note that in the best case the convergence of SURE is slower than that of PSURE. However, since for ill-posed problems the condition number of A will grow with m the typical case is rather divergence of cond(A) 2 √ m , hence the empirical estimates of the regularization parameters might have a large variation, which will be confirmed by the numerical results below.  were computed over the α-grid. As in some cases, the supremum is obtained in the limit α → ∞, and hence, on the boundary of our computational grid, we also evaluated (24) for α = ∞ in these cases.

4.2.
Illustration of Theorems. We first illustrate Theorems 3.3 and 3.8 by computing (22) and (23) based on our samples. The results are plotted in Figure 6 and show that the asymptotic rates hold. For SURE, the comparison between Figures 6(b) and 6(c) also shows that the dependence on cond(A) is crucial.

4.3.
Dependence on the Ill-Posedness. We then demonstrate how the empirical distributions ofα and the corresponding 2 -error, x * − xα 2 2 , such as those plotted in Figure 3, depend on the ill-posedness of the inverse problem.
Dependence on m. In Figures 7 and 8, m is increased while the width of the convolution kernel is kept fix. The impact of this on the singular value spectrum is illustrated in Figure 2. Most notably, smaller singular values are added and the condition of A increases (cf. Table 1). Figures 7(a) and 8(a) suggest that the distribution of the optimal α * is Gaussian and converges to a limit for increasing m. Dependence on l. In Figures 9 and 10, the width of the convolution kernel, l, is increased while m = 64 is kept fix (cf. Figure 2 and Table 1). It is worth noticing that as l = 0.02 corresponds to a very well-posed problem, the optimal α * is often extremely small or even 0, as can be seen from Figure 9

4.4.
Linear vs Logarithmical Grids. One reason why the properties of SURE exposed in this work have not been noticed so far is that they only become apparent in very ill-conditioned problems (cf. Section 1). Another reason is the way the risk estimators are typically computed: Firstly, for high dimensional problems, (3) often needs to be solved by an iterative method. For very small α, the condition of (A * A+ αI) is very large and the solver will need a lot of iterations to reach a given tolerance. If, instead, a fixed number of iterations is used, an additional regularization of the solution to (1) is introduced which alters the risk function. Secondly, again due to the computational effort, a coarse, linear α-grid excluding α = 0 instead of a fine, logarithmic one is often used for evaluating the risk estimators. For two of the risk estimations plotted in Figure 5(c), Figure 11 demonstrates that this insufficient coverage of small α values by the grid can lead to missing the global minimum and other misinterpretations.

5.
Numerical Studies for Non-Quadratic Regularization. In this section, we consider the popular sparsity-inducing R(x) = x 1 as a regularization functional (LASSO penalty) to examine whether our results also apply to non-quadratic regularization functionals. For this, let I be the support ofx α (y) and J its complement. Let further |I| = k and P I ∈ R k×n be a projector onto I and A I the restriction of A to I. We have that df α = x α (y) 0 = k and gdf α = tr(ΠB [J] ), B [J] := P I (A * I A I ) −1 P * I , as shown, e.g., in [39,14,12], which allows us to compute PSURE (7) and SURE (9). Notice that whilex α (y) is a continuous function of α [7], PSURE and SURE are discontinuous at all α where the support I changes. To carry out similar numerical studies as those presented the last section, we have to overcome several non-trivial difficulties: While there exist various iterative optimization techniques to solve (2) nowadays (see, e.g., [8]), each method typically only works well for certain ranges of α, cond(A) and tolerance levels to which the problem should be solved. In addition, each method comes with internal parameters that have to be tuned for each problem separately to obtain fast convergence. As a result, it is difficult to compute a consistent series ofx α (y) for a given logarithmical α-grid, i.e., that accurately reproduces all the change-points in the support and has a uniform accuracy over the grid. Our solution to this problem is to use an all-at-once implementation of ADMM [6] that solves (2) for the whole α-grid simultaneously, i.e., using exactly the same initialization, number of iterations and step sizes. See Appendix B for details. In addition, an extremely small tolerance level (tol = 10 −14 ) and 10 4 maximal iterations were used to ensure a high accuracy of the solutions. Another problem for computing quantities like (24) is that we cannot compute the expectations defining the real risks MSPE (7) and MSEE (9) anymore: We have to estimate them as the sample mean over PSURE and SURE in a first run of the studies, before we can compute (24) in a second run (wherein MSPE and MSEE are replaced by the estimates from the first run). We considered scenarios with each combination of m = n = 16, 32, 64, 128, 256, 512, l = 0.02, 0.04, 0.06 and σ = 0.1. Depending on m, N ε = 10 5 , 10 4 , 10 4 , 10 4 , 10 3 , 10 3 noise realizations were examined. The computation was based on a logarithmical α-grid where log 10 α is increased linearly in between -10 and 10 with a step size of 0.01.
Risk Plots. Figure 12 shows the different risk functions and estimates thereof. The jagged form of the PSURE and SURE plots evaluated on this fine α-grid indicates that the underlying functions are discontinuous. Also note that while PSURE and SURE for each individual noise realization are discontinuous, MSPE and MSEE are smooth and continuous, as can be seen already from the empirical means over N ε = 10 4 .
Empirical Distributions. Figure 13 shows the empirical distributions of the different parameter choice rules for α. Here, the optimal α * is chosen as the one minimizing the 1 -error x * −xα 1 to the true solution x * . We can observe similar phenomena as for 2 -regularization. In particular, the distributions for SURE, also have multiple modes at small values of α and at large values of 1 -error.  Figure 11. Illustration of the difference between evaluating the SURE risk on a coarse, linear grid for α as opposed to a fine, logarithmic one: In (a), a linear grid is constructed aroundα DP as α = ∆ α , 2∆ α , . . . , 50∆ α with ∆ α = 2α DP /50. While the plot suggests a clear minimum, (b) reveals that it is only a sub-optimal local minimum and that the linear grid did not cover the essential parts of SURE(α, y). (c) and (d) show the same plots for a different noise realization. Here, a linear grid will not even find a clear minimum. Both risk estimators are the same as those plotted in Figure 5(c) with the same colors.
Sup-Theorems. Due to the lack of explicit formulas for the 1 -regularized solution x α (y), carrying out similar analysis as in Section 3 to derive theorems such as Theorems 3.3 and 3.8 is very challenging. In this work, we only illustrate that similar results may hold for the case of 1 -regularization by computing the left  hand side of (22) and (23) based on our samples. The results are shown in Figure  14 and are remarkably similar to those shown in Figure 6.
Linear Grids and Accurate Optimization. All the issues raised in Section 4.4 about why the properties of SURE revealed in this work are likely to be overlooked when working on high dimensional problems are even more crucial for the case of 1regularization: For computational reasons, the risk estimators are often evaluated on a coarse, linear α-grid using a small, fixed number of iterations of an iterative method such as ADMM. Figure 15 illustrates that this may obscure important features of the real SURE function, such as the strong discontinuities for small α, or even change it significantly. 6. Conclusion. We examined variational regularization methods for ill-posed inverse problems and conducted extensive numerical studies that assessed the statistical properties different parameter choice rules. In particular, we were interested in the influence of the degree of ill-posedness of the problem (measured in terms of the condition of the forward operator) on the probability distributions of the selected regularization parameters and of the corresponding induced errors. This perspective revealed important features that were not discussed or noticed before but are essential to know for practical applications, namely that unbiased risk estimators encounter enormous difficulties: While the discrepancy principle yields a rather unimodal distribution of regularization parameters resembling the optimal one with slightly increased mean value, the PSURE estimates start to develop multimodality, and the additional modes consist of underestimated regularization parameters, which may lead to significant errors in the reconstruction. For the case of SURE, which is based on a presumably more reliable risk, the estimates produce quite wide distributions (at least in logarithmic scaling) for increasing illposedness, in particular there are many highly underestimated parameters, which clearly yield bad reconstructions. We expect that this behaviour is rather due to the bad quality of the risk estimators than the quality of the risk. These findings may be explained by Theorem 3.8, which indicates that the estimated SURE risk might deviate strongly from the true risk function MSEE when the condition number of A is large, i.e. the problem is asymptotically ill-posed as m → 0. Consequently one might expect a strong variation in the minimizers of SURE with varying y compared (d) Figure 15. Illustration of the difficulties of evaluating the SURE risk in the case of 1 -regularization: In (a), a coarse linear grid is constructed aroundα DP as α = ∆ α , 2∆ α , . . . , 20∆ α with ∆ α = α DP /10. Similar to Figure 11(a) the plot suggests a clear minimum. However, using a fine, logarithmic grid, (b) reveals that it is only a sub-optimal local minimum before a very erratic part of SURE(α, y) starts. (c) shows how a coarse α-grid can lead to an arbitrary projection of SURE(α, y) that is likely to miss important features. Both risk estimators are the same as those plotted in Figure 12(c) with the same colors. In (d), the difference between computing SURE(α, y) with the consistent and highly accurate version of ADMM (Impl A) and with a standard ADMM version using only 20 iterations (Impl B) is illustrated.
to the ones of MSEE. A potential way to cure those issues is to develop novel risk estimates for MSEE that are not based on Stein's method, possibly it might even be useful not to insist on the unbiasedness of the estimators. We finally mention that for problems like sparsity-promoting regularization, the SURE risk leads to additional issues, since it is based on a Euclidean norm. While the discrepancy principle and the PSURE risk only use the norms appearing naturally in the output space of the inverse problem (or in a more general setting the log-likelihood of the noise), the Euclidean norm in the space of the unknown is rather arbitrary. In particular, it may deviate strongly from the Banach space geometry in 1 or similar spaces in high dimensions. Thus, different constructions of SURE risks are to be considered in such a setting, e.g. based on Bregman distances.
Appendix A. Proofs.
for all s, t ∈ T, then Consider the process Obviously, Sl 3 is almost surely bounded and which yields where we used that 4 Setting ν = 1/2 above, the claim now follows.

RISK ESTIMATORS FOR CHOOSING PARAMETERS IN ILL-POSED PROBLEMS 29
It remains to show the L 2 -convergence (22). To this end define the j-th partial sum S j := j i=1ε i and observe that {S j | j ∈ N} forms a martingale. The L p -maximal inequality for martingales yields as above.
Proof of Lemma 3.4. It is straightforward to see the differentiability of MSPE and to compute Hence, for α < σ 2 maxi |x * i | 2 , the risk MSPE is strictly decreasing, which implies the first inequality. Moreover, for α ≥ 1 we obtain and we finally see that MSPE is nonnegative if in addition α ≥ 8σ 2 Proof of Theorem 3.5. From the uniform convergence of the sequence f m k in Proposition 1 we obtain the convergence of the minimizersα MSPE,m k . Combined with Theorem 3.3 we obtain an analogous argument forα PSURE,m k .