Tikhonov regularization with oversmoothing penalty for nonlinear statistical inverse problems

In this paper, we consider the nonlinear ill-posed inverse problem with noisy data in the statistical learning setting. The Tikhonov regularization scheme in Hilbert scales is considered to reconstruct the estimator from the random noisy data. In this statistical learning setting, we derive the rates of convergence for the regularized solution under certain assumptions on the nonlinear forward operator and the prior assumptions. We discuss estimates of the reconstruction error using the approach of reproducing kernel Hilbert spaces.


Introduction
We consider the nonlinear ill-posed operator equation of the form A(f ) = g with a nonlinear forward operator A : H → H between the infinite-dimensional Hilbert spaces H and H .Moreover, H is the space of functions g : X → Y for a Polish space X (the input space) and a real separable Hilbert space Y (the output space).Ill-posed inverse problems have important applications in the field of science and technology (see, e.g., [13,15,29,31]).
In classical inverse problem setting, we observe the approximation g δ of the function g with g − g δ H ≤ δ for some known noise level δ, then we reconstruct the estimator of the quantity f through the regularization schemes.Here we consider the problem in statistical learning setting in which we observe the random noisy image y i at the points x i .The problem can be described as follows: (1) where ε i is the random observational noise with 1 ≤ i ≤ m and m is called the sample size.The model (1) covers nonparametric regression under random design (which we also call the direct problem, i.e., A = I), and the linear statistical inverse learning problem.Thus, introducing a general nonlinear operator A gives a unified approach to the different learning problems.
Suppose the random observations are drawn identically and independently according to the joint probability measure ρ on the sample space Z = X × Y and the probability measure ρ can be splitting as follows: ρ(x, y) = ρ(y|x)ν(x), where ρ(y|x) is the conditional probability distribution of y given x and ν(x) is the marginal probability distribution on X.
For the statistical inverse problem (1), the goodness of an estimator f can be measured through the expected risk: Further, we assume that Y y 2 Y dρ(y|x) < ∞ for any x ∈ X.Then for the function the expected risk can be expressed as follows: Hence we observe that finding the minimizer of the expected risk is equivalent to obtaining the minimizer of the quantity X A(f )(x) − g ρ (x) 2 Y dν(x).Since the probability measure ρ is unknown, the only information of the probability measure is known through the sample.Therefore we use the regularization methods to stably reconstruct the estimator of the quantity f .The Tikhonov regularization is widely considered in both the classical inverse problems and the statistical learning theory.We consider the Tikhonov regularization in Hilbert scales which consists of the error term measuring the fitness of data and oversmoothing penalty.We introduce an unbounded, closed, linear, self-adjoint, strictly positive operator L : D(L) ⊂ H → H with a dense domain of definition D(L) ⊂ H to treat an oversmoothing penalty in terms of a Hilbert scale.For some > 0, the operator L satisfies: For a given sample z = {(x i , y i )} m i=1 , we define Tikhonov regularization scheme in Hilbert scales: (5) Here f ∈ D(A) ∩ D(L) denotes some initial guess of the true solution, which offers the possibility to incorporate a priori information.Here λ is a positive regularization parameter which controls the trade-off between the error term and the complexity of the solution.
In many practical problems, the operator L which influences the properties of the regularized approximation is chosen to be a differential operator in some appropriate function spaces, e.g., the space of square-integrable functions L2 (X, ν; Y ).It is well-known that the standard Tikhonov regularization suffers the saturation effect.The finite qualification of Tikhonov regularization can be overcome using the Hilbert scales.The problem ( 5) is non-convex, therefore the minimizer may not exist in general.For the continuous and weakly sequentially closed1 operator A, there exists a global minimizer of the functional in (5).But it is not necessarily unique since A is nonlinear (see [29,Section 4.1.1]).
Generally, in the classical inverse problem literature (see [4,13,17,29] and references therein), the 2-step approaches are considered in which first they construct the estimator of the function g by g δ from the observations {(x i , y i )} m i=1 , then estimate the quantity f stably using the various regularization schemes.Here we estimate the quantity f in a 1-step method using the Tikhonov regularization scheme (5) in the statistical learning setting.Now we review the work in the literature related to the considered problem.Regularization schemes in Hilbert scales are widely considered in classical inverse problems (with deterministic noise) [12,16,22,24,25,26,30].On the contrary, the inverse problems with random observations are not well-studied.The linear statistical inverse problems are studied in [11], under the assumption that the marginal probability measure ν is known which is an unrealistic assumption since the only information is available through the input points (x 1 , . . ., x m ).This problem is also discussed in [7] for the general random design with an unknown marginal probability measure.
In this nonlinear setup, the reference [27] established the error estimates for the generalized Tikhonov regularization for (1) using the linearization technique in a random design setting.In other work, the authors [4] consider a 2-step approach, however, again under the assumption of the norm in L 2 (X, ν; Y ) being known.The references [3] and [17,32] consider respectively a Gauss-Newton algorithm and the Tikhonov regularization for certain nonlinear inverse problem, but also in the idealized setting of Hilbertian white or colored noise with known covariance, which can only cover sampling effects when L 2 (X, ν; Y ) is known.Loubes et al. [20] discussed the problem (1) under a fixed design and concentrate on the problem of model selection.Finally, the recent work [1] discussed the rates of convergence for the Tikhonov regularization of the nonlinear inverse problem.
In contrast with the existing work [3,4,17,32] our results are improved in three respects: • We do not restrict ourselves to the Hilbertian white or colored noise.
• We consider a 1-step approach rather than existing 2-step approaches for the nonlinear inverse problems.• The considered approach does not suffer the saturation effect of standard Tikhonov regularization.
Following the work [1,7], we develop the error analysis for the Tikhonov regularization scheme for the nonlinear inverse problems in Hilbert scales in the statistical learning setting.We establish the error bounds for the statistical inverse problems in reproducing kernel approach.We discuss the rates of convergence for Tikhonov regularization under certain assumptions on the nonlinear forward operator and the prior assumptions.
Some structural assumptions are required on the nonlinear mappings A to establish the convergence analysis.We consider the widely assumed conditions in the literature of the classical inverse problems, first assumed in [17], and presented in detail in the monograph [29].We assume that the operator A is Fréchet differentiable at the true solution, the Fréchet derivative is Lipschitz continuous and satisfies the link condition (for precise statement see Assumption 4).
The goal is to analyze the theoretical properties of the Tikhonov estimator f z,λ , in particular, the asymptotic performance of the regularization scheme is evaluated by the error estimates of the Tikhonov estimator f z,λ in the reproducing kernel approach.Precisely, we develop a non-asymptotic analysis of Tikhonov regularization (5) for the nonlinear statistical inverse problem based on the tools that have been developed for the modern mathematical study of reproducing kernel methods.The challenges specific to the studied problem are that the considered model is an inverse problem (rather than a pure prediction problem) and nonlinear.The rate of convergence for the Tikhonov estimator f z,λ to the true solution is described in the probabilistic sense by exponential tail inequalities.For sample size m and the confidence level 0 < η < 1, we establish the bounds of the form Here the function m → ε(m) is a positive decreasing function and describes the rate of convergence as m → ∞.
The paper is organized as follows.In Section 2, we discuss the basic definition and assumptions required in our analysis.In Section 3, we discuss the bounds of the reconstruction error under certain assumptions on the (unknown) joint probability measure ρ, and the (nonlinear) mapping A. In Appendix, we present the probabilistic estimates and the preliminary results which provide the tools to obtain the error bounds in reproducing kernel approach.

Notation and assumptions
In this section, we introduce some basic concepts, definitions, and notations required in our analysis.
2.1.Reproducing Kernel Hilbert space and related operators.We start with the concept of the reproducing kernel Hilbert spaces.It is a subspace of L 2 (X, ν; Y ) (the space of square-integrable functions from X to Y with respect to the probability distribution ν) which can be characterized by a symmetric, positive semidefinite kernel and each of its function satisfies the reproducing property.Here we discuss the vector-valued reproducing kernel Hilbert spaces [23] which are the generalization of real-valued reproducing kernel Hilbert spaces [2].Definition 2.1 (Vector-valued reproducing kernel Hilbert space).For a non-empty set X and a real separable Hilbert space (Y, •, • Y ), a Hilbert space H of functions from X to Y is said to be the vectorvalued reproducing kernel Hilbert space, if the linear functional F x,y : H → R, defined by is continuous for every x ∈ X and y ∈ Y .
Throughout the paper, T * denotes adjoint of an operator T .
For a given operator-valued positive semi-definite kernel K : X × X → L(Y ), we can construct a unique vector-valued reproducing kernel Hilbert space (H, •, • H ) of functions from X to Y as follows: (i) We define the linear function in other words f (x) = K * x f .Moreover, there is a one-to-one correspondence between operator-valued positive semi-definite kernels and vector-valued reproducing kernel Hilbert spaces [23].The reproducing kernel Hilbert space becomes real-valued reproducing kernel Hilbert space, in the case that Y is a bounded subset of R, and the corresponding kernel becomes the symmetric, positive semi-definite We assume the following assumption concerning the Hilbert space H : The space H is assumed to be a vector-valued reproducing kernel Hilbert space of functions f : Note that in case of real-valued functions (Y ⊂ R), Assumption 1 simplifies to the condition that the kernel is measurable and κ Now we introduce some relevant operators used in the convergence analysis.We introduce the notations for the discrete ordered sets x = (x 1 , . . ., x m ), y = (y 1 , . . ., y m ), z = (z 1 , . . ., z m ).The product Hilbert space Y m is equipped with the inner product y, y m = 1 m m i=1 y i , y i Y , and the corresponding norm We define the sampling operator S x : H → Y m : g → (g(x 1 ), . . ., g(x m )), then the adjoint S * x : Y m → H is given by Let I ν be the canonical injection map H to L 2 (X, ν; Y ).Then we observe that both the canonical injection map I ν and the sampling operator S x are bounded by κ under Assumption 1, since and We denote the population version T ν = I * ν I ν : H → H , the corresponding covariance operator.The operator T ν is positive, self-adjoint and depends on both the kernel and the marginal probability measure ν.We also introduce the sampling version operator T x = S * x S x which is positive, self-adjoint and depends on both the kernel and the inputs x.
By the spectral theory, the operator L s : D(L s ) → H is well-defined for s ∈ R, and the spaces H s := D(L s ), s ≥ 0 equipped with the inner product f, g Hs = L s f, L s g H , f, g ∈ H s are Hilbert spaces.For s < 0, the spaces H s is defined as completion of H under the norm f s := f, f 1/2 s .The space (H s ) s ∈ R is called the Hilbert scale induced by L. We notice that the space H 0 is H according to the above notations.The interpolation inequality is an important tool for the analysis: (6) f Hr ≤ f Hs , f ∈ H s which holds for any t < r < s.

2.2.
The true solution, noise condition, and nonlinearity structure.We consider that random observations {(x i , y i )} m i=1 follow the model y = A(f )(x) + ε with a centered noise ε.We assume throughout the paper that the operator A is injective.
Assumption 2 (The true solution).The conditional expectation w.r.t.ρ of y given x exists (a.s.), and there exists From (3) we observe that f ρ is the minimizer of the expected risk.The element f ρ is the true solution which we aim at estimating.

Assumption 3 (Noise condition).
There exist some constants M, Σ such that for almost all x ∈ X, This Assumption is usually referred to as a Bernstein-type assumption.The distribution of the observational noise reflects in terms of the parameters M > 0, Σ > 0. For the convergence analysis, the output space need not be bounded as long as the noise condition for the output variable is fulfilled.
We need the assumption on the nonlinearity structure of operator A to establish the rates of convergence.Following the work of Engl et al. [13,Chapt. 10], [17] on 'classical' nonlinear inverse problems, we consider the following assumption: Assumption 4 (nonlinearity structure).(i) D(A) is convex, A : D(A) ∩ D(L) → H is weakly sequentially closed and A is Fréchet differentiable with derivative A : D(A) → L(H, H ). (ii) the Fréchet derivative A (f ) is bounded in a ball of sufficiently large radius d, i.e., there exists J < ∞ such that (Link condition) There exists constants p > 0 and α, β > 0 such that for all g ∈ H, (iv) (Lipschitz continuity of A ) For all f ∈ D(A) ∩ D(L) , there exists a constant γ such that A sufficient condition for weak sequential closedness is that D(A) is weakly closed (e.g.closed and convex) and A is weakly continuous.The link condition (Assumption 4 (iii)) is an interplay between the operator L −1 and the Fréchet derivative of the operator A. This link condition is known as finitely smoothing.This condition is satisfied in various types of problems (for examples see [9, Example 10.2], [32, Example 4, 5]).

2.3.
Effective dimension.Now we introduce the concept of the effective dimension which is an important ingredient to derive the rates of convergence rates [7,10,14,19,21,28].The effective dimension is defined as Using the singular value decomposition T ν = ∞ i=1 t i •, e i H e i for an orthonormal sequence (e i ) i∈N of eigenvectors of T ν with corresponding eigenvalues (t i ) i∈N such that t 1 ≥ t 2 ≥ . . .≥ 0, we get Hence the function λ → N (λ) is continuous and decreasing from ∞ to zero for 0 < λ < ∞ for the infinite-dimensional operator T ν (see for details [5,8,18,21,33]).Since the integral operator T ν is a trace class operator, the effective dimension is finite and we have that ( 7) Assumption 5 (Polynomial decay condition).Assume that there exists some positive constant c > 0 such that Assumption 6 (Logarithmic decay condition).Assume that there exists some positive constant c > 0 such that Lu et al. [21] showed that different kernels with some probability measures show different behavior of the effective dimension.For Gaussian kernel K 1 (x, x ) = xx + e −8(x−x ) 2 with the uniform sampling on [0, 1], the effective dimension exhibits the log-type behavior (Assumption 6), on the other hand, the kernel K 2 (x, x ) = min{x, x } − xt exhibits the power-type behavior (Assumption 5).
Caponnetto et al [10] showed that if the eigenvalues t n 's of the integral operator T ν follow the polynomial decay: i.e., for fixed positive constants µ and b < 1, then the effective dimension behaves like power-type function (Assumption 5).

Convergence analysis
Here we establish the error bounds for the Tikhonov regularization for the nonlinear statistical inverse problems in the H-norm in the probabilistic sense.The explicit expression of f z,λ is not known, therefore we use the definition (5) of the Tikhonov estimator f z,λ to derive the error estimates.The linearization techniques is used for nonlinear operator A in the neighborhood of the true solution f ρ .The rates of convergence are established by exploiting the nonlinearity structure of operator A (see Assumption 4).We discuss the rates of convergence for the Tikhonov estimator by measuring the effect of random sampling which is governed by the noise condition (Assumption 3).The bounds of the reconstruction error depend on the effective dimension, the smoothness parameter q of the true solution and the parameter p related to the link condition.
It is convenient to introduce the "standardized quantities used in our analysis.Here we introduce shorthand notation for some key quantities.We let and .
The error bound discussed in the following theorem holds non-asymptotically, but this holds with the following choice of the regularization parameter λ and sample size m.We can choose appropriate regularization parameter λ and sample size m such that the following holds: The condition (8) says that as the regularization parameter λ decreases, the sample size m must increase.
Theorem 3.1.Let z be i.i.d.samples drawn according to the probability measure ρ.If Assumptions 1-4 and (8) hold true and if f ρ − f ∈ H q for some q ∈ [1, 2 + p].Then, for the Tikhonov estimator f z,λ in (5) with the a-priori choice of the regularization parameter λ = Θ −1 N ,p,q , for all 0 < η < 1, the following error bound holds with the confidence 1 − η: .
Proof.By the definition of f z,λ as the solution of minimization problem in (5), we have By linearizing the nonlinear operator A at f z,λ we get ( 10) where r(f z,λ ) is the error term by linearizing the operator A at true solution f ρ .Using this we reexpress the inequality (9) as follows, Then we have, which implies Now with Assumption 4 and (4) from Lemmas A.3-A.4 we obtain, where Under the condition (iii) of Assumption 4 using the interpolation inequality (6), we obtain H which can be re-expressed as =O In the analysis, we will make repeated use of the following: We apply this inequality to the estimate (13) for c = f z,λ − f ρ H−p and r = 2. First we take t = 1, d = H and we obtain H and e = δ 1 + δ

H
, where δ 2 4 = δ 1 + δ 2 3 .Replacing the term that contains f z,λ − f ρ H−p on the right-hand side in (12) and using the inequality (x + y) r ≤ x r + y r for 0 ≤ r ≤ 1 we obtain , Under the condition (8), from Propositions A.1-A.2 we get with the probability 1 − η, Under the condition (8) the spectral decomposition of the operator T ν gives ( 18) From (8) we get Hence we get,

Discussion
We discussed a finite sample bound of Tikhonov regularization scheme for nonlinear statistical inverse problems in vector-valued setting, therefore the results can be applied to the multitask learning problem.The convergence rates presented in Section 3 hold asymptotically, i.e., all parameters are fixed as m → ∞.The considered framework covers previously proposed settings for different learning schemes: direct, linear inverse learning problems.
The rates of convergence were represented in terms of the effective dimension N (λ) of the governing operator T ν which can be seen from the basic probabilistic bound, given in Proposition A.1.Also, the Corollaries 3.3 and 3.4 can be given a handy representation of the error bounds under different behavior of the effective dimension corresponding to the ill-posedness of the problem.This is well-known that Tikhonov regularization suffers the saturation effect.We observe from the analysis in Section 3 that using the Tikhonov regularization in Hilbert scales the saturation effect can be ignored.
The a-priori parameter choice considered in our analysis requires the knowledge of the parameters b, p, q, which is typically unknown in practice.In practice, a posteriori parameter choice rule (data-dependent) for the regularization parameter λ such as the Lepskii-balancing principle, discrepancy principle, quasioptimality principle with theoretical justification need to be considered so that we can turn our results to data-dependent minimax adaptivity without a priori knowledge of the regularity parameters.

Appendix A. Probabilistic estimates and preliminaries results
The following bounds are standard in learning theory, in which we estimate the effect of random sampling using Assumption 3 in the probabilistic sense.The following propositions can be proved similar to the arguments given in [10, Theorem 4].
Proposition A.1.Suppose Assumptions 1-3 hold true, then for m ∈ N and 0 < η < 1, each of the following estimate holds with the confidence 1 − η, In the following proposition, the probabilistic estimate of the first term can be established under the condition (8) Proof.Under the Assumption 4 we obtain and Lemma A.4.For the error term in eq. ( 10) under the Assumption 4 we have: Proof.From Lemma A.3 and (4) under Assumption 4 we have, and using the inequality f, g H =λ f, (T ν + λI) −1 g H + f, T ν (T ν + λI) −1 g H ≤ √ λ f H + I ν f L 2 (X,ν;Y ) (T ν + λI) −1/2 g H we obtain,

( 14 )
c r ≤ e + dc t ⇒ c r = O e + d r r−t which holds for 0 ≤ t < r and c, d, e > 0.