Total generalized variation regularization in data assimilation for Burgers' equation

We propose a second-order total generalized variation (TGV) regularization for the reconstruction of the initial condition in variational data assimilation problems. After showing the equivalence between TGV regularization and a Bayesian MAP estimator, we focus on the detailed study of the inviscid Burgers' data assimilation problem. Due to the difficult structure of the governing hyperbolic conservation law, we consider a discretize–then–optimize approach and rigorously derive a first-order optimality condition for the problem. For the numerical solution, we propose a globalized reduced Newton-type method together with a polynomial line-search strategy, and prove convergence of the algorithm to stationary points. The paper finishes with some numerical experiments where, among others, the performance of TGV–regularization compared to TV–regularization is tested.


1.
Introduction. Data assimilation can be described as the process in which one aims to find an estimate of the state of a system using observations and background information, in order to obtain improved numerical forecasts of the system. The resulting inverse problems are typically ill-posed, since one seeks a solution using only partial observations of the state. As a remedy, regularization approaches such as filtering, a posteriori regularization, variational regularization, etc., are usually considered [17,18,21].
In this paper, we focus our attention on the variational data assimilation framework, whose main idea consists in solving an optimization problem that fits the observations, on the one hand, and uses background information of previous forecasts, on the other. The background term usually plays the role of a regularizer, if one compares with an inverse problems methodology. Classically, variational data assimilation problems arise from a robust Bayesian estimation of the initial condition [14,17]. If the observations are taken at one sole time instant, one obtains a so-called 3D-Var problem, while if a whole time window is considered, the so-called 4D-Var problem is obtained.
A generic mathematical formulation of the 4D-Var data assimilation problem consists in solving the least squares formulation: subject to: where N is the number of time instants at which the observations are taken and each vector z i consists of the observations spread over the spatial mesh at the i -th time instant. Furthermore, B is the covariance matrix of the background error, R i the covariance matrix for the observational error, the operator H i (·) is the observation operator that maps the model state to the observation space for each instant of time and M i (·) is the nonlinear state operator for each instant.
Although 4D-Var has been successfully tested and proved to be of operational utility in Numerical Weather Prediction (NWP), there are some shortcomings of the standard approach (1). These are related to the fact that the system state may present discontinuities, which are poorly reconstructed with the standard methodology. In fact, discontinuities in the form of so-called sharp fronts, are of key importance in Numerical Weather Prediction. With this in mind, alternative regularizations such as Total Variation (TV) have been previously considered in [5,11], where a comparison with respect to more standard regularizations has been carried out with promising results.
Total variation, however, has its downside too, since it performs well mainly when the reconstructed solution has a piecewise constant structure, something that cannot, in general, be expected in, e.g., fluid dynamics. To overcome the piecewise constant reconstruction effect, higher order regularizers have been considered in recent years, with the total generalized variation (TGV) as one of their main representatives [4,3,15,2]. The second-order total generalized variation regularization depends on two positive parameters α > 0 and β > 0 and has the following structure: (2) R(u) = min w∈R n−1 α where the matrix D corresponds to the discrete gradient and E to the symmetrized discrete gradient.
In this paper, we investigate the use of second-order total generalized variation regularization for the solution of data assimilation problems constrained by the inviscid Burgers' equation. This equation is a first model for turbulence and a standard toy problem for the dynamics of the atmosphere. Due to its hyperbolic conservation law structure, we adopt a discretize-then-optimize approach for the solution of the problem, to avoid complications related to the well-posedness of the equation in function spaces [19]. A similar approach has been previously pursued in [1].
Specifically, we consider the problem subject to the discretized Burgers' equation, where R(·) is the regularization term given in (2). For problem (3) we analyze the well-posedness of the discretized equation, existence of optimal solutions, wellposedness of the adjoint equation and the characterization of local minima by means of a first order optimality system of Karush-Kuhn-Tucker type. The most challenging aspect concerns the numerical solution of the problem. In order to devise a fast superlinear convergent algorithm, we consider a Huber regularization of the total generalized variation and start by verifying the consistency of the approximation. Each regularized problem is solved by means of a reduced Newton's method based on a modified primal-dual formulation of the optimality system. The modification concerns the second-order iteration matrix, which in its original form does not guarantee descent in each iteration. The global convergence of the resulting algorithm is demonstrated.
The organization of the paper is as follows. Section 2 is devoted to discuss the equivalence of the TGV regularized problem with a Bayesian MAP estimator. In Section 3 we analyze the discretization of the Burgers' equation as well as the corresponding data assimilation problem, obtaining a necessary first-order optimality condition for the optimization problem. In Section 4 a Huber regularized version of the data assimilation problem is introduced and the consistency of the regularization proved. The next section is devoted to the presentation of a globalized Newton-type method considering dual variables and projections, while its convergence study is carried out in Section 6. In the last section, we present numerical experiments that reveal the improvement in the recovery of the solutions when we use the TGV-regularization and test the performance of the proposed algorithm.
2. Maximum A Posteriori estimation. This section aims to show that problem (3) has a statistical interpretation and can be derived using a Bayesian approach to obtain the Maximum A Posteriori (MAP) estimator. With this purpose, let us start by writing the TGV regularized data assimilation problem in generic form as: , u) and the matrices As mentioned in the introduction, we expect the solutions of the data assimilation problem to allow discontinuities. This information is in fact what we know a priori about the solution, and we use it for the MAP estimation. The next lines are developed following the ideas in [16].
In the Bayesian method, unlike the maximum likelihood estimation, we assume that the unknown is a random variable, and we know a priori its probability density function p(x) and the conditional probability density function (c.p.d.f.) of the observationsz given x, denoted by p(z|x). The basic idea of this approach is to combine the information from both probability density functions using Bayes' formula in order to obtain an a posteriori probability. In this case, it is assumed that the observations and background information have normally distributed errors with covariance matrices R and B, respectively. Therefore, the conditional probability density function (c.p.d.f.) is given by the following Gaussian distribution: where m o is the total amount of observations. Now, we need to look for a probability distribution that enables to capture the expected main features of the solutions. Since these solutions are expected to be discontinuous, previous works suggest using the Laplace distribution [5,16]. Specifically, it is assumed that the solution is a realization of a random process defined by the Laplace prior probability density function for the matrix Dx, with parameters θ = 1 and µ = 0, i.e., Using Bayes' formula we get that the a posteriori probability of x given the observationsz is Since we aim to find the best "Bayesian estimator" using the maximum a posteriori approach, we maximize the previously obtained a posteriori probability The distribution p(z) is the marginal probability and it does not depend on x. Thus, we optimize Since the logarithmic function is monotonically increasing and continuous, solving (5) is equivalent to maximize the natural logarithm of the objective function, that is max x ln(p(z|x)) + ln(p(x)).
Replacing the definition of the probability density functions mentioned above, we get and, consequently, the equivalence of the 4D-VAR problem with TGV-regularization.
3. Analysis of the data assimilation problem. This section is devoted to the analysis of the data assimilation problem for the Burgers' equation. We start with the discretization of the state system using a finite difference scheme for the spatial discretization and a semi-implicit method for the time discretization. The existence of solutions to the fully discretized equation is justified thereafter using the implicit function theorem. The same technique is applied to justify existence of an adjoint state and derive a first-order optimality condition of Karush-Kuhn-Tucker type. This approach is known as discrete-then-optimize.
In order to solve the latter, both a spatial and a time discretization scheme have to be utilized. For the spatial discretization several methodologies, like finite differences, finite elements or finite volumes, have been classically considered. Here we focus on a finite differences scheme with a homogeneous partition of the domain Ω = (0, 1), whose discretization points are given by with h = 1 /n+1 and n the amount of spatial discretization points. For the approximation of the spatial first derivative of y i we use a standard backward finite difference scheme, with the corresponding matrix denoted by U . For the time discretization, in order to guarantee convergence towards a (possibly) discontinuous solution, we focus on semi-implicit schemes. Moreover, to avoid the solution of the non-linearity in each time step, we consider a semi-implicit Euler scheme given by: where ∆t = 1/(N t +1) and N t is the number of time discretization points. Choosing this kind of discretization, the problem reduces to the solution of a linear system at each time discretization point, with the following structure: (6) y j+1 + ∆t diag(y j )U y j+1 = ∆t g j+1 + y j .
Considering both the spatial and time discretization, we can rewrite the matrix y ∈ R n×Nt as a vector with the following structure where y i ∈ R n , thus y ∈ R m with m = n · N t . In this context, we define the new matrices The system that we have to solve is then given by: In order to prove the local existence of solutions of the state equation we will use the implict function theorem (see, e.g., [7]). For this purpose, we consider hereafter the following assumption.
. . , n. Remark 1. Let us notice that Assumption 1 is of importance not only for the analysis, but also for the numerical solution of both the state equation and the data assimilation problem. This hypothesis may be constructively guaranteed by using an upwinding scheme.
Computing the partial derivative of the state equation operator e with respect to the state, in direction ϕ, we can write the linearized equation as follows: for φ ∈ R m . The equation can also be written in short form as (11) [E + ZU + diag (Uy)] ϕ = φ.
Proof. From equation (10a) ϕ 1 is uniquely defined. Given ϕ j , for some j ≥ 1, the equation for ϕ j+1 is given by Since by assumption y j i U ii ≥ 0, for all i = 1, . . . , n, the matrix 1 ∆t I + diag(y j )U is strictly diagonally dominant and, therefore, nonsingular. Hence, ϕ j+1 is uniquely determined. Proceeding recursively, the result is obtained.
Theorem 3.1. Let Assumption 1 hold. Let e(y, u) be the discretized Burgers' equation operator given by (9) and (ȳ,ū) such that e(ȳ,ū) = 0. Then there exists a neighborhood U ofū and a continuously differentiable implicit function y(u) such that e(y(u), u) = 0, for all u ∈ U.
Proof. Due to the definitions of the matrices E, Z, U we know that e(y, u) is continuously differentiable and its derivatives are given by e y (y, u) = E + ZU + diag (Uy) and e u (y, u) = [− 1 /∆tI, 0, 0, . . . , 0] T . Let (ȳ,ū) be such that e(ȳ,ū) = 0. From Lemma 1 we know that e y (ȳ,ū) is invertible and, from the implicit function theorem, there exists an open neighborhood U ofū and a continuously differentiable function y(u) such that e(y(u), u) = 0, for all u ∈ U.
In the next Lemma we prove the boundedness of y with respect to the data. Lemma 3.2. Let Assumption 1 hold and let y be the solution of the state equation e(y, u) = 0. The following inequality holds (13) y Proof. We start by recalling the state equation given by e(y, u) = Ey + Z(y)Uy − g(u) = 0, where y = (y 1 , . . . , y Nt ) T and g(u) = ( u /∆t, g 2 , . . . , g Nt ) T . Performing a componentwise analysis we distinguish two cases: For j = 1: For j = 2, . . . , N t : Multiplying the previous equality with (y j ) T and using the hypothesis y j i U ii ≥ 0, for each j = 1, . . . , N t − 1 and all i = 1, . . . , n, we get that y j 2 ≤ (y j ) T I + ∆t diag(y j−1 )U y j ≤ y j−1 y j + ∆t g j y j , which implies that y j ≤ y j−1 + ∆t g j , for all j = 1, . . . , N t .
Using the discrete Gronwall's inequality we finally obtain y k ≤ u + ∆t k j=2 g j , for all k = 1, . . . , N t .

3.2.
Discretized data assimilation problem. For the discretization of problem (3), we assume that the observation operator H(·) is linear and, therefore, there exist matrices S ∈ R mo×(noNt) and H ∈ R noNt×m such that H(x) = SHx, with m = nN t , n o the amount of observations taken in space, N o the number of instants at which we take the observations and m o = n o N o . Thus, the problem is given by: where the matrix D corresponds to the discrete gradient associated with a forward finite differences scheme and E is the one associated with backward finite differences. It is worth to remark that since we work with 1D-spatial functions the symmetrized discrete gradient match with the usual one. The Tikhonov term µ 2 w T w is further added to obtain well-posedness of the inverse problem.

Adjoint state.
This subsection is devoted to discuss the main features of the adjoint equation. We start by noticing that the adjoint equation is given by the following expression: Thanks to Proposition 1, which establishes the invertibility of the matrix e y (y, u), existence of a unique solution to the adjoint equation (15) follows (see, e.g., [8]).
In the next Lemma we prove the boundedness of the adjoint state, by making use of the discrete Gronwall's inequality [20]. Lemma 3.3. Let Assumption 1 hold and let (y, p) be the state and adjoint state associated to some control u. Let us moreover assume that there exists a constant c > 0 such that Then the following inequality holds with ρ > 0 a constant independent of u and y.
Proof. Defining p = Pv, with P a permutation matrix given by and I the identity matrix, we get that and therefore the adjoint system (15), in the variable v, is given by Performing a component-wise analysis in time we distinguish the following cases: For j = 1: For j = 2, . . . , N t − 1: For j = N t : Multiplying by v Nt−j+1 it then follows, for all j = 1, . . . , N t − 1, that where the last inequality was obtained thanks to the bound (13). Since by hypothesis there exists a constant c > 0 such that (16) holds, we get that Finally, applying the discrete Gronwall's inequality the result follows with ρ = ∆t c exp ( 1 /c − 1).

3.4.
Optimality system. Next we derive formally a necessary optimality condition for problem (14) in form of a Karush-Kuhn-Tucker system.
Theorem 3.4. Letū be an optimal solution for problem (14) and let Assumption 1 hold. Then there exists an adjoint statep ∈ R m such that the following optimality system is satisfied: Proof. Due to the nondifferentiability of the objective function, we use [8, Thm. 6.1] in order to compute the optimality system of the problem. Thus, we define the functions We notice that the function j 1 (u, w) is differentiable and j 2 (u, w) is continuous and convex. Thus, using [8, Thm 6.1] we know that the optimality condition of the problem is given by the following variational inequality: For the first term we have Introducing now the adjoint statep as the unique solution to (15) and replacing this in (21), we get Furthermore, by differentiating (8), in the direction h, we get the linearized equation Using the adjoint operator of e y (ȳ,ū), the linearized equation (23) and the fact that 4. Regularized data assimilation problem. As a preparatory step for the solution of the data assimilation problem (14) we consider next a properly regularized version, which consists in replacing the non-differentiable part of the objective function, by a differentiable one. Since we are going to use second order optimization methods we choose a C 2 Huber regularization, which, for a ∈ R, is given by the following expression: Therefore, the regularized version of the 1 -norm is given by The first derivative of the regularized 1 -norm is the vector whose components are given by and the second derivative is a diagonal matrix with elements: Therefore, the regularized objective function takes the following form Furthermore, the TGV-regularized problem can be written in the following way Consistency of the regularization. In this section, we analyze the convergence of the regularized solutions towards the solution of the original data assimilation problem. Let us start by noticing that, for a ∈ R, the following holds: In order to prove that the solutions of the regularized problem converge to the original, we need to show the uniform boundedness of the solutions to the regularized data assimilation problems with respect to the Huber regularization parameter γ.
The next theorem shows the convergence of the solutions of the regularized problems to the solution of the original one.
Theorem 4.1. Let (y γ , u γ , w γ ) be a sequence of global solutions to problem (28), where y γ satisfies e(y γ , u γ ) = 0. Furthermore, let (y, u, w) be a global solution of problem (14). Then, there exists a subsequence (y γ k , u γ k , w γ k ) such that Proof. We start by recalling that the matrix B is a covariance symmetric and positive definite matrix. Hence, where λ min (B −1 ) is the minimum eigenvalue of the matrix B −1 . From the definition of the objective function it follows that Using the optimality of (y γ , u γ , w γ ) and the fact that H γ (a) ≤ |a|, for all a ∈ R, Thus, we can conclude that and, therefore, the sequence {u γ } is bounded. By the Bolzano-Weierstrass Theorem there exists a convergent subsequence {u γ k } whose limit is denoted byū. Using similar arguments we can conclude that which is a uniform bound for {w γ }. Using again the Bolzano-Weierstrass Theorem, there exists a convergent subsequence denoted by {w γ k } whose limit will be denoted byw. Moreover, using the definition of the vector y γ and by Lemma 3.2, the sequence {y γ } is also bounded and there exists a convergent subsequence {y γ k } with its limit denoted byȳ. Thus, the triplet (ȳ,ū,w) is a candidate for solution of problem (14). Next we will prove the feasibility of (ȳ,ū,w), that is, e(ȳ,ū) = 0. Recalling the state equation given in (8) we know that where Z γ k = Z(y γ k ). Taking the limit as k → ∞ we have In order to guarantee thatȳ satisfies the state equation we have to prove that lim k→∞ Z γ k Uy γ k =ZUȳ, whereZ = Z(ȳ). This follows from the continuity of Z(y) with respect to the argument. Consequently, Since (y, u, w) is a global solution of problem (14) we conclude that (ȳ,ū,w) is also a global solution.
In a similar manner as in the proof of Theorem 3.4, we obtain existence of a Lagrange multiplier p γ (adjoint state) such that the following optimality system characterizing the optimal solutions of (28) holds: where h γ (·) is given in (25). Differently from system (19), the regularized optimality condition (29) does not involve variational inequalities. This fact facilitates the solution of the resulting nonlinear system, although enough care has still to be taken in order to get a robust and fast convergent method for its solution.
Furthermore, due to the continuity of the solution mapping (u, w) → y(u, w) (see Lemma 3.1), we can define the reduced functional as: with its gradient given by where p γ = [p 1 γ , p 2 γ , . . . , p Nt γ ] T satisfies equation (29b).

5.
Reduced primal-dual Newton type method. The numerical solution of the TGV data assimilation problem can in principle be carried out by solving the optimality system or, alternatively, by using iterative methods for the solution of the optimization problem. In this paper, we focus our attention on the use of an iterative Newton-type method, with the additional use of auxiliary dual multipliers for improving the robustness of the approach. The proposed algorithm has the following ingredients: 1. Use of Newton's method for the regularized optimization problems; 2. Use of both primal and dual variables for the solution of a saddle point system in order to get search directions; 3. Projection of the dual variables to get modified second-order matrices that guarantee descent directions; 4. Use of polynomial line-search rule for the step length.
We start by recalling Newton's method for our equality constrained optimization problem, which is given through the following steps: Compute y k the solution of the state equation given in (8).

4:
Compute p k the solution of the adjoint equation given in (15).

5:
Solve the system whose first order optimality condition is Exploiting the structure of the first derivative of the Lagrangian, we include dual variables in order to transform the system into a saddle point problem. This technique is known to increase the stability of the system (see, e.g., [6]). Therefore, we add two variables q 1 and q 2 and get the following equivalent expressioñ The second derivative of this extended Lagrangian is then given bỹ where h γ (·) is the second derivative of the Huber regularization given in (26) and By analyzing the elements of the diagonal of the second derivative given by (26), it can be verified that there are a lot of zero entries, which causes ill-conditioning of the matrix, with the corresponding algorithmic robustness issues. Moreover, the positive definiteness of the matrix L does not necessarily hold. For these reasons, we modify the matrix in such a way that better conditioning properties and positive definiteness are obtained.
To do so, we consider the primal dual formulation of the problem and perform a projection of the dual variables (see [12]). Specifically, this process consists in replacing the term which appears in the second derivative of the Huber regularization, by the following term q 1 max{1, |q 1 |} Du − w |Du − w| 2 , where (u w) is the component-wise product of the vectors u and w. Thus, h γ (Du − w) is replaced by the diagonal matrix Q 1 given by where ϑ i = E i w and the division, max and | · | are component-wise operations. It is worth to remark that in the last term of the third case of the Q 1 and Q 2 matrices we do not perform the projection. In this way, the modified second derivative of the Lagrangian used in the method isL (y,u,w,q1,q2) (y, u, w, q 1 , q 2 , p) = Furthermore, since the state equation does not depend on p, q 1 nor q 2 , its derivatives with respect to these variables are equal to zero. Now, the system that defines the method is given by:

Reduced method and matrix properties.
Before presenting the convergence result of the proposed algorithm, we start by analyzing the reduced version of it. From Lemma 1 we know that the matrix Ξ in (37) is invertible and, therefore, the following equalities hold: System (39) can then be rewritten as follows: or, equivalently, where ∇f (u, w) was defined in (31) and Once we have defined the reduced matrices we present the reduced method given in Algorithm 2.
Algorithm 2 Reduced Newton-type method 1: Initialize k = 0, u 0 2: repeat 3: Compute y k the solution of the state equation given in (8).

4:
Compute p k the solution of the adjoint equation given in (15).

5:
Compute the descent direction by solving system (40)

6:
Compute s k using Algorithm 3 below.

7:
Set u k+1 = u k + s k δ u , w k+1 = u k + s k δ w and x k = (u k , w k ).

8:
k ← k + 1. 9: until Stopping criteria We focus next on the main features of the matrix Π that can guarantee that d k = [δ u , δ w ] T is indeed a descent direction. We start by proving its symmetry. Proof. Since Π is a block-matrix we know that We analyze each of the components starting with Π 11 , which is given by Since B is a covariance matrix, it is symmetric and its inverse as well. It remains to prove that the matrices Q 1 and Ψ are also symmetric. For the matrix Q 1 , we know by definition that it is a diagonal matrix. Finally, we are going to analyze the matrix Ψ given by We notice that R is also a covariance matrix and, therefore, symmetric. Since K is a diagonal matrix (see eq. (35)), Ψ is also symmetric and therefore Π 11 is symmetric. We now focus on Π T 22 = µI + αQ T 1 + βE T Q T 2 E. Since Q 1 and Q 2 are diagonal matrices, it immediately follows that Π 22 is symmetric.
By definition we have and the proof is complete. . Now we are going to prove that Π is positive definite and, therefore, that the directions given in (40) are indeed descent directions for the data assimilation problem. Before stating the main result, we prove some required lemmata.
Lemma 5.1. If the term H T S T R −1 (SHy − z) is small enough, then there exists a constant κ > 0 such that for all ξ ∈ R m \{0} we have ξ T L (y,y) (y, u, w, p)ξ ≥ κ ξ 2 .
Proof. The first and second partial derivatives of the Lagrangian (32), with respect to y, are given by respectively, where K is given in (34). Using the definition of the matrix K and multiplying both sides by ξ ∈ R m \ {0} we get where c is the smallest eigenvalue of the matrix H T S T R −1 SH. The last inequality was obtained by applying Lemma 3.3. Defining and setting we guarantee that κ > 0 and thus we get the desired result. Proof. Multiplying on both sides of Π by a block vector (x, y) we obtain that To proof the statement we have to verify that both Q 1 and Q 2 are positive semidefinite. The matrix Q 1 is diagonal, thus we just need to prove that each element of its diagonal is positive. We start noticing that |(q 1 ) i | max{|(q 1 ) i |, 1} ≤ 1 and, therefore, Now using definition (35) we analyze the following three cases. D i u − w i ∈ A: Using the bound given in (42) we have From the definition we have We start by recalling the structure of the set . The elements of the diagonal of Q 1 are given by Using the same argument as in the first case we have By the definition of the set I it follows that θ γ ≥ 0 and 1 − γ 2 Therefore, Since each element of the diagonal of Q 1 is non-negative we can conclude that the matrix is in fact a positive semi definite matrix. Furthermore, we justify why the last term on this expression was not projected, because otherwise we can not guarantee that this term is positive.
In a similar manner it can be proved that Q 2 is also positive semi-definite. Altogether we obtain that 6. Convergence analysis. This section is devoted to the study of the convergence properties of the proposed algorithm. Our aim is to use the general results developed for the convergence analysis of descent methods (see [8,Theorem 4.2]). In order to apply these results, we need to verify that the angle condition and the efficiency of the line-search, i.e., where x = (u, w) and f stands for the reduced cost function, i.e., f (x) = J γ (y(u), u). These conditions are required to hold on the level set, which is defined next.
Definition 6.1. Let ρ > 0 be a constant, the level set of the problem is given by Due to the continuous differentiability of y(u, w) (see Proposition 3.1) and the structure of the objective function it is easy to prove that f (x) is radially unbounded and continuous, therefore, the level set N ρ 0 is compact. 6.1. Angle condition. We start by proving that the angle condition (43) holds. (41), then there exist constants 0 < c <c independent of k such that for all q ∈ R n , for all x k ∈ N ρ 0 . Proof. We start recalling the structure of the matrix Π.
Concerning the upper bound, we get that For the last term in the previous inequality we first get a bound on Ψ k , which is given by Using the definition of K k given in (34) and Lemma 3.3, where the last inequality was obtained using Lemma 3.2 and the fact that u k ∈ N ρ 0 . Defining ξ := 2ρC 0 U H T S T R −1 we get that Ψ k ≤ H T S T R −1 SH + ξ =: C Ψ . For the norm of Ξ −1 we consider an a-priori estimate of the adjoint equation. Specifically, let v be solution of (47) Ξv = Ev + Z k Uv + diag (Uy k )v = g.
Proceeding as in the proof of Lemma 3.3 we get that v ≤ ρ g , with ρ independent of y and u, and therefore Ξ −1 ≤ ρ.
Inserting the last results in (46) and taking into account that Q i ≤ γ, for i = 1, 2, we then obtain where C := ρ 2 CΨ (∆t) 2 . Therefore, the constant that we were looking for is It is worth to remark that, from Proposition 4 it follows immediately that the descent directions {d k } generated by Algorithm 2 satisfy the angle condition (see, e.g., [8]).
6.2. Uniform continuity of the gradient. In order to present the result that guarantees the uniform continuity of the reduced gradient, we start by proving the following lemmata that allow us to write, in a shorter way, the main theorem of this subsection.
Proof. From Proposition 3.1 we know that y(u) = [y 1 (u), y 2 (u), . . . , y Nt (u)] is a continuously differentiable function on R m , which means that y j (u) is continuously differentiable for all j = 1, . . . , N t . Using the mean value theorem we can conclude that y j (u 1 ) − y j (u 2 ) ≤ (y j ) (ξ) u 1 − u 2 for some ξ that belongs to a neighborhood which contains u 1 , u 2 . Since (y j ) is continuous and u 1 , u 2 belong to the level set N ρ 0 and this is compact, there exists C > 0 that satisfies the required bound. Lemma 6.3. Let u 1 , u 2 ∈ N ρ 0 . Furthermore, let y(u 1 ) and y(u 2 ) be solutions of the state equations e(y(u 1 ), u 1 ) = 0 and e(y(u 2 ), u 2 ) = 0, respectively. In addition let p(u 1 ), p(u 2 ) be the associated adjoint states. Then the following bound holds whereC > 0 is a constant independent of u 1 , u 2 and y or p.
Proof. From the analysis component-wise in time of two adjoint equations (15) in the variables u 1 , u 2 ∈ N ρ 0 and denoting A := 1 ∆t I + diag (y j−1 (u 1 ))U T + diag (U y j (u 1 )) , we get the following inequality: where the last inequalities were obtained using Lemma 3.2 and the fact that u 1 ∈ N ρ 0 , that is, there exists K > 0 such that u 1 ≤ K. Denoting c = 1 /∆t − U (K + ∆t g ) we get the result. Therefore, Using Lemma 6.2 and the fact that u 1 , u 2 ∈ N ρ 0 leads to: with η > 0 a constant independent of u 1 , u 2 and y or p. Finally applying the discrete Gronwall's inequality we get the result with Next, we prove the uniform continuity of the gradient of the reduced cost functional.
Theorem 6.4. The gradient of the reduced cost functional, given by is uniformly continuous on the level set N ρ 0 . Proof. The proof follows directly from the Heine Theorem since, thanks to Lemmas 6.2 and 6.3, ∇f (u, w) is continuous over the level set N ρ 0 , which is compact. 6.3. Polynomial line-search. Concerning the line-search procedure, we consider the polynomial backtracking algorithm proposed in [10]. In the following lines we provide a short presentation of the algorithm.
It is known that the usual backtracking is defined by a parameter λ ≤ 1 such that in each iteration the new step length is computed as s k = λ k . Nevertheless, there are some other methods that instead of taking a constant value of λ, they make it variable throughout the iterations, and this is the case of the polynomial line-search rule.
The main feature of the polynomial backtracking is to choose the step length s k by using a quadratic or cubic approximation of the objective function. Indeed, if the Armijo condition is not satisfied in the first iteration, i.e., wheref (s) = f (x s ), x s = x + sd and 0 < c 1 1 is a given constant, the algorithm uses the known function information and the quadratic model whose exact minimizer is given by The algorithm checks if the Armijo condition is fulfilled or not. If the answer is yes, the algorithm stops with s k = s. Otherwise the new information is used in order to build a cubic model of the function given by the expressioñ m c (s) = as 3 + bs 2 +f (0)s +f (0), , whose minimizer is given by This method can be summarized through the following steps: It is worth to remark that despite the different ways of computing the value of the step-length, the Armijo condition still has to hold, that is Using [13, Lemma 2.2] we can guarantee that there exist steps s k that satisfy the Armijo condition, no matter which backtracking rule we are using.
For the polynomial backtracking, the following bounds for the step lengths hold. If in the first iteration the Armijo condition is not fulfilled, the previous step-length is bounded by .
If the Armijo condition is not fulfilled in the next iterations, the following bound holds where s prev is the previous step-length.
Since ∇f is uniformly continuous on the compact level set N ρ 0 (see Theorem 6.4), there exists η independent of x k such that, if d ≤ η, then Jointly with (55) and the result above it is clear that the Armijo condition holds whenever s k d k ≤ η. Using the bounds obtained for the polynomial backtracking (51) and (52) we know that s k = 1 and the Armijo condition is fulfilled in the first iteration or we have two possible scenarios: • Armijo condition is fulfilled in the second iteration: • Armijo condition is fulfilled in a later iteration: Therefore, for all k ∈ K it follows that Using again the Armijo condition we have Consequently, f (x k ) − f (x k + s k d k ) 0 and we get the result.
Condition (53) is automatically fulfilled for our algorithm since the matrices {Π k } satisfy the condition given in Proposition 4 and ∇f is uniformly continuous thanks to Lemma 6.4. 6.4. Convergence result. Summarizing, thanks to the satisfaction of the angle condition, the uniform continuity of the gradient and the efficiency of the line search, we arrive at the following global convergence result for the proposed algorithm. Although we do not provide any proof of superlinear convergence, in the tests below this convergence rate can be experimentally verified. 7. Numerical experiments. In this section, we report on numerical experiments whose primary goal is to show the behavior of TGV compared to other variational regularizations, for the solution of the data assimilation problem of the inviscid Burgers' equation. Moreover, the performance of the proposed solution algorithm for the TGV data assimilation problem is experimentally verified.
Three experiments have been designed for these purposes using the spatial domain Ω = (0, 10) and the time interval [0, T ] = [0, 1]. The first one aims to verify the well-known staircase effect of total variation (TV) and also to show how the total generalized variation regularization (TGV) behaves in such cases. The second experiment has been constructed with the purpose of illustrating the performance of the proposed algorithm regarding its convergence and its behavior with respect to the regularization parameters. The last experiment aims to study the influence of the number of observations and the background covariance matrix in the performance of the algorithm.
We perform the following steps for the construction of the dataset: 1. Set an exact solution (initial condition); 2. Solve the state equation with the exact solution and extract perfect observations. The observations will be extracted every M instants in time and N node in the space. That means, we will take the instants of time 1, M +1, 2M +1, . . . and the nodes 1, N + 1, 2N + 1, . . .; 3. Add a Gaussian noise with zero mean and a covariance matrix B = σI to the exact solution for the background information; 4. Fix the parameters α, β, γ and µ. The parameters for the TGV-regularization were chosen following the heuristic described in [9], where β α ∈ 1 n (0.75, 1.5). For the background matrix we set σ = 0.1, and assume perfect observations, i.e., R = I. Finally, as stopping criteria we use 7.1. Comparison between TV and TGV regularization. This experiment is designed with the purpose of showing the staircase effect of the TV regularization on the data assimilation problem and the way how the TGV regularization eliminates it. We solve problem (29) with 50 spatial discretization points and 150 points for the time discretization. Furthermore, we take 25 observations from 7500 possible choices, using M = 25 N = 10.
The exact solution for this experiment is given by the expression in Figure 1. As quality measure we use the SSIM (structural similarity index) [22], typically considered in image and signal processing. This index is given by the following expression SSIM (x, y) = (2µ x µ y + C 1 )(2σ xy + C 2 ) (µ 2 x + µ 2 y + C 1 )(σ 2 where µ x , µ y are the mean and σ 2 x , σ 2 y are the variances of the vectors x and y, respectively. Furthermore, σ xy is the covariance between x and y, and the constants  C 1 = k 1 L 2 , C 2 = k 2 L 2 , with k 1 = 0.01, k 2 = 0.03 and L = 2. It is easy to note that SSIM = 1 if the two functions are identical. Therefore, we will look for a set of parameters that guarantee the solution to be as close as possible to 1. Figure 2 shows the value of the SSIM index as we vary the parameters for the TV and TGV regularizations. Now we choose the maximum value of SSIM for each problem and the solutions obtained for the best cases in terms of the SSIM index are presented in Figure 3. The best TV solution was obtained by setting β = 0.85 and the algorithm converges after 21 iterations with γ = 1e5 and SSIM = 0.9495. On the other hand, the best TGV solution was obtained with α = 23.5 and β = 0.611 and the algorithm converges after 15 iterations with γ = 1e4, µ = 1e − 10 and the value of SSIM = 0.9581. Furthermore, Figure 4 shows the way the optimal state obtained with the TGV regularization fits with the given observations for three specific snapshots.

7.2.
Performance of the algorithm. This experiment is devoted to the study of the performance of the algorithm concerning its rate of convergence, global convergence, behavior with respect to the parameters and generation of descent directions.    Figure 5. Furthermore, for each experiment we present the best solution (in terms of the SSIM) obtained. For Experiment 1, we used the parameters α = 5 and β = 0.075, after 15 iterations the best solution obtained with a SSIM factor of 0.9592 is presented in the lower left part of Figure 5. The best solution showed in the lower middle part of Figure 5, is the same showed in the previous subsection. Finally, the best solution for Experiment 3 was obtained by using the parameters α = 2.5 and β = 0.075 and the solution obtained has a SSIM factor of 0.9594.
For the global convergence of the algorithm, we use different exact solutions and provide the algorithm with different starting points. In particular, we use     Figure 6 shows that, in each case, the same value of the objective function is obtained when the algorithm stops. Moreover, this experiment allows us to show numerically that in all the cases the algorithm generates in fact descent directions. We use a fixed value of the parameters α, β as follows: for experiments 1 and 2 we use α = 10 and β = 0.2 and for experiment 3, α = 5, β = 0.1. Furthermore, we use the parameters γ = 1e4, µ = 1e − 10.
Another critical aspect in the study of the performance of the algorithm is the analysis of the rate of convergence. Since the proposed method is a second order method, it is natural to expect a local superlinear order, and that is what is analyzed in the following experiment. For this purpose, we solve the same problems as in the previous case but setting the initial value as u 0 ≡ 1, and we vary the parameters α, β. Figure 7 shows the decreasing of the residual computed as where we use u * as the final value to which the algorithm converges. Furthermore, this figure suggests that the algorithm converges locally superlinearly, and since these results have been obtained by varying the values of α and β, this behavior appears to be independent of the value of the regularization parameters.
We study the behavior of the algorithm when we vary the value of µ. From the theoretical point of view, the addition of this parameter is important to guarantee the generation of descent directions. However, we are also aware that adding this term is only an artifice. We solve two experiments using as exact solutions the ones depicted in Figure 8. Both experiments were solved with α = 2.5, β = 0.075 and γ = 1e4. The results are presented in Table 1, where NaN means that the algorithm could not find a descent direction due to the singularity of the matrix. Adding the term involving the quadratic norm of w let us obtain a positive definite matrix and, therefore, nonsingular. In any iteration, the algorithm actually finds a descent direction. Furthermore, Table 1 also shows that we can use small values for the parameter µ without significantly changing the optimal value of the objective function. Finally, we also present the time of execution for these experiments. The computations were performed in a CPU Intel Core i7-6700, with 3.40 GHz, with a storage of 8MB.    Table 2. Summary of the experiment of observations and the value of the background covariance matrix in the quality of the solutions, and how these factors affect the performance of the algorithm. The results are presented in Table 2, where a weak dependence of the number of iterations with respect to the number of observations can be observed. For the experiment we use the exact solution from Experiment 2 and the parameters α = 23.5, β = 0.611 and µ = 1e − 10. We vary the number of time instants and spatial points at which we obtain the observations. Finally, in order to reduce the value of the noise we vary the value σ within the background covariance matrix B = σI.