STABILIZED BFGS APPROXIMATE KALMAN FILTER

. The Kalman ﬁlter (KF) and Extended Kalman ﬁlter (EKF) are well-known tools for assimilating data and model predictions. The ﬁlters re- quire storage and multiplication of n × n and n × m matrices and inversion of m × m matrices, where n is the dimension of the state space and m is dimension of the observation space. Therefore, implementation of KF or EKF becomes impractical when dimensions increase. The earlier works provide optimization- based approximative low-memory approaches that enable ﬁltering in high dimensions. However, these versions ignore numerical issues that deteriorate performance of the approximations: accumulating errors may cause the covariance approximations to lose non-negative deﬁniteness, and approximative inversion of large close-to-singular covariances gets tedious. Here we introduce a formulation that avoids these problems. We employ L-BFGS formula to get low-memory representations of the large matrices that appear in EKF, but inject a stabilizing correction to ensure that the resulting approximative representations remain non-negative deﬁnite. The correction applies to any symmetric covariance approximation, and can be seen as a generalization of the Joseph covariance update. We prove that the stabilizing correction enhances convergence rate of the covariance approximations. Moreover, we generalize the idea by the means of Newton-Schultz matrix inversion formulae, which allows to employ them and their generalizations as stabilizing corrections.


(Communicated by Jari Kaipio)
Abstract. The Kalman filter (KF) and Extended Kalman filter (EKF) are well-known tools for assimilating data and model predictions. The filters require storage and multiplication of n × n and n × m matrices and inversion of m×m matrices, where n is the dimension of the state space and m is dimension of the observation space. Therefore, implementation of KF or EKF becomes impractical when dimensions increase. The earlier works provide optimizationbased approximative low-memory approaches that enable filtering in high dimensions. However, these versions ignore numerical issues that deteriorate performance of the approximations: accumulating errors may cause the covariance approximations to lose non-negative definiteness, and approximative inversion of large close-to-singular covariances gets tedious. Here we introduce a formulation that avoids these problems. We employ L-BFGS formula to get low-memory representations of the large matrices that appear in EKF, but inject a stabilizing correction to ensure that the resulting approximative representations remain non-negative definite. The correction applies to any symmetric covariance approximation, and can be seen as a generalization of the Joseph covariance update.
We prove that the stabilizing correction enhances convergence rate of the covariance approximations. Moreover, we generalize the idea by the means of Newton-Schultz matrix inversion formulae, which allows to employ them and their generalizations as stabilizing corrections.
1. Introduction. In recent years several approximations for KF and EKF have been proposed to reduce the computational complexity in large-scale cases, such as arise in robotics or data assimilation tasks. One of the simplest ways to reduce matrix storage requirements is to replace high-order covariances with their sampled estimates, which leads to the large family of ensemble Kalman filters (EnKF), first proposed in [18]. The advantageous property of EnKF formulations is their parallel nature as the ensemble members can be processed independently from each other.
The quality of EnKF estimates usually depends on the way the ensemble members are generated and on their ability to capture the essential part of the state dynamics.
An alternative approach is exploited by the Reduced Rank Kalman filter and the Reduced Order Extended Kalman filter that are based on projection of the full state space onto its subspace of a smaller dimension (see e.g. [13], [10], [34], [20]). The success of this approach is based on judicious choice of reduction strategy applied to the state space.
Another emerging solution to perform large-scale data assimilation similar to approximate Kalman filter is the Weak-Constraint 4D-VAR analysis, which was firstly proposed in [35] and later suggested as a replacement for the 4D-VAR, the dominating method for data assimilation in weather prediction [32], [22]. A very similar tool was also applied in oceanography, where the data assimilation was used in a study of the German bight [5].
One of the possible EKF approximations can be obtained by replacing the explicit algebraic formulas used to calculate posterior state vector and covariance matrix with a maximum posterior optimization problem. This allows to formulate Variational Kalman filter (VKF), which avoids undesired explicit storage of large matrices by employing Quasi-Newton optimization algorithms [3], [4]. The VKF performance depends on the smallest eigenvalue magnitude of the estimate covariance matrix. If the matrix degenerates to singularity, the VKF must be regularized by adjusting the model error covariance and the observation noise level. Another general problem shared by the variational formulations is necessity to invert the forecast estimate covariance matrix (i.e. prediction covariance), which sometimes is close to singular, so that the inversion operation should be replaced by pseudoinversion, whereas this may not be easily implemented in large-scale.
A variant of approximation for the EKF can be derived by directly replacing the high-order matrices in the EKF formulas with their low-memory approximative versions. This replacement is done by introducing auxiliary optimization problems, which thereafter are solved using a Quasi-Newton or conjugate-gradient optimization methods. In contrast to the variational formulations mentioned above the direct approximation does not act over approximate inversion of the full forecast estimate covariance. Instead, the inversion is applied to the forecast estimate covariance projected onto the observation space. This is advantageous when the number of observations is small compared to the dimension of dynamical system. Hence, these direct approximations possess potentially lucrative properties, not preserved by their variational counterparts.
Several algorithms based on the direct approximation idea have been discussed in [2] and [4]. However, the previously proposed methods provide no guarantee that the approximated covariance matrix remains non-negative definite. This results in accumulation of approximation errors. In addition, the optimization methods that produce low-memory covariance approximations often rely on non-negative definiteness of the cost function's Hessian [25]. Therefore, when such methods are combined with direct EKF formulae approximations that do not retain non-negative definiteness property, the resulting low-memory representations may be inadequate.
The problem of poor numerical estimation of the analysis covariance matrix, which in practical implementations may become indefinite due to the round-off errors, was firstly addressed by Joseph and prompted him to reformulate the covariance update (see [30]). However, Joseph's arrangement requires almost the order of magnitude more arithmetical operations than the original algorithm by Kalman.
An alternative approach, which does not impose extra arithmetical complexity, was suggested by Potter [12]. The key idea of this method is to factorize the analysis covariance matrix and to perform the updates on the square root, which eventually leads to more precision-tolerant numerics. An exhaustive overview of this idea together with its further extensions is given in [7]. However, both the Joseph update and the square root filters inspired by Potter have storage requirements similar to that of the classical sequential Kalman filter. Therefore, neither of them address the problems that arise in case when the state space dimension is large.
In this paper we propose stabilized approximation for the EKF, which corrects the previously suggested approximative formulas with no negative impact on their limited-memory design. We prove that with this correction applied, the approximative estimate covariance matrix remains non-negative definite regardless of the approximation quality. Furthermore, we show that the stabilizing correction ensures quadratic convergence rate, while the approximations previously suggested in [2] and [4] can only guarantee linear convergence rate. We also propose a possible generalization of the suggested stabilizing correction achieved by using Newton-Schultz inverse matrix recurrent formula. This approach can be even further improved by using Schultz-type recurrent formulas of the higher orders based on Chebyshev polynomials [27].
This paper contains three sections apart from the introduction. In the methods section we discuss the direct EKF approximation [2] based on L-BFGS unconstrained optimizer (see e.g. [25]). In this section we also introduce the family of stabilizing corrections and prove non-negative definiteness of the corresponding estimate covariance approximations.
The last two sections are devoted to examples. In this paper we consider application of the proposed method for a filtering task formulated on top of an observationdeficient non-linear large-scale chaotic dynamical system. As an example of such system we employ the two-layer Quasi-Geostrophic model (QG-model) [19], which is one of the common benchmarks employed to estimate performance of data assimilation algorithms [21].
In the last section approximative EKF with stabilized correction is compared with direct L-BFGS EKF approximation described in [2]. By running these filtering algorithms for the QG-model initialized at different resolutions it is demonstrated that direct approximation fails when assimilation model dimension increases. In all experiments the filters are used to deduce full state of the QG-model based on deficient observation data, which constitutes a challenging ill-posed inverse problem. In addition, the dimension of the model is kept low enough to allow use of the exact EKF formulas, so that the EKF estimates can be employed as a baseline for performance analysis of Stabilized Approximate EKF (SA-EKF) proposed in this paper. The criteria used to measure quality of the estimates are the root mean square error (RMS error) and the forecast skill.
Finally, the paper is wrapped out with a short conclusion part based on the experiment results. In this section we summarize the properties and point out the benefits of the stabilizing correction. The paper has an appendix section, which comprises self-contained derivation of matrix approximation formulas based on BFGS update.

2.
Methods. We begin our discussion of approximative Kalman filters by considering the following coupled system of stochastic equations: In the first equation x k denotes n×1 state vector at time instance k, M k is transition operator from time instance k to time instance k + 1, k is a random n × 1 vector representing modelling error of the transition operator M k . In the second equation y k+1 denotes m × 1 observed vector at time instance k + 1, H k+1 is observation operator, and η k+1 is an m × 1 random vector representing the observation error.
From now on we assume that the error terms k and η k+1 are mutually independent, with zero mean and covariance matrices equal C k and C η k respectively. Following general formulation of a filtering problem the task is to derive an estimate of system state x k+1 at time instant k + 1 and compute the corresponding estimate covariance matrix C est k+1 given the estimate x est k of x k together with its analysis covariance C est k , observed data y k+1 , and covariance matrices C k and C η k of the corresponding error terms. Putting it another way, the problem in consideration is to invert deficient observation data based on previously made analytical prediction and related statistical parameters. Since the problem can be significantly ill-posed, the quality of resulting estimates is expected to depend on particular properties of observation operator and on the noise levels.
2.1. Extended Kalman filter. Kalman filter is the standard approach taken for the formulated inverse problem. This is an optimal decision making tool which couples the information from model prediction and observed data and produces the estimate for the next state. In a case in which M k and H k are linear, the Kalman filter estimate of x k+1 is proved to be optimal in the sense it yields the minimal variance unbiased linear estimator [23]. Below we formulate the basic Kalman filter algorithm with linear transition operator M k and linear observation operator H k (the notation here is altered due to the assumed linearity).
(ii) Correction step a) Compute the Kalman gain b) Compute the next state estimate ; c) Compute covariance of the next state estimate The extension of the algorithm to nonlinear case is provided by the extended Kalman filter (EKF). It is obtained by using the nonlinear model in prediction step (i. a), and in simulated observations (step ii.b), i.e. in these steps M k and H k+1 should be replaced with M k and H k+1 respectively. The rest of the algorithm employs Jacobians instead of nonlinear operators M k and H k : For large models the derivatives have to be implemented as special code-level subroutines to avoid explicit matrix storage. The routines used to compute Jacobian matrices M k and H k are called tangent-linear codes. Subroutines implementing the corresponding adjoint operators M T k and H T k are called adjoint codes. To separate notation used in algorithm I hereinafter we will use the superscript (·) T L to denote a tangent-linear code and the superscript (·) AD to emphasize an adjoint code.
2.2. L-BFGS low-memory matrix approximation. The EKF implementation complexity is estimated as O(n 3 + m 3 ) multiplications. Namely, EKF requires to store and multiply n × n, m × m and n × m matrices, and to invert an m × m matrix. This may become prohibitively expensive when the state space dimension n or the observation space dimension m increases. In case of large n or m EKF can be approximated by replacing the impracticable high-order matrices with their approximative low-memory representations. A number of different low-memory matrix approximations have been suggested (see e.g. [2], [3], [4]).
In this paper the desired approximation is attained by using the BFGS update [25]. The BFGS update is a rank-two update formula addressed by a number of Quasi-Newton unconstraint optimizers. The present study employs Limited-Memory BFGS iterative optimization method (L-BFGS) that uses the BFGS update to produce low-memory approximation for Hessian of the cost function. More precisely, the following quadratic minimization problem is considered: where A is an n-by-n positive-definite matrix, b is an n-by-1 vector. When applied to the problem (4), the L-BFGS iterations generate series of vector pairs (s 1 , y 1 ), . . . , (s k , y k ) such that As i = y i , i = 1, . . . , k. As discussed in [9] this iteration history can be then used to construct low-memory approximation for the matrix A or A −1 . This idea allows us to reformulate the EKF by explicitly replacing the prohibitively high-order matrices with their L-BFGS low-memory approximations. In this paper we denote such approximations of the problematic matrices by adding symbol "˘" on top of the regular matrix notation introduced in algorithm I. For instance, an approximation of the estimate covariance matrix C est k introduced in algorithm I is denoted byC est k , whereas the nature of the approximation will be clarified by the context.
Hereby, the algorithm that we call L-BFGS Approximate EKF can be formulated as follows: On the code level define prior covariance operator Inverse Problems and Imaging Volume 9, No. 4 (2015), 1003-1024 (ii) Correction step a) Define optimization problem of type (4) assuming . Set x * to be the minimizer of this problem and B * to be the L-BFGS approximation of the matrix The outlined algorithm defines one iteration of the L-BFGS approximate EKF. The step (ii. c) can be modified by assuming that vector b is a normally distributed random variable with zero mean and identity covariance [4]. The step (i. b) on the code level must be implemented as a subroutine. This means that the high order matrices from algorithm I either get implemented on the code level as memoryefficient subroutines (assuming that such implementation is available) or represented by the means of iteration history generated by L-BFGS. Since the number of L-BFGS iterations is usually much smaller than the order of target Hessian matrix [25], the storage requirements are notably reduced with respect to that of previously discussed algorithm I.
It should also be mentioned that in algorithm II high order matrix multiplications are never actually done and either performed implicitly in an efficient way by the subroutines containing the corresponding TL-and AD-codes or performed based on L-BFGS iteration history by the procedure called two-loop recursion (see [9]).
This wraps out the overview of the previously proposed methods that were used as the basis for the algorithm proposed in this paper. In the next section containing the main part of the study we outline the problematic parts of algorithm II and provide an arrangement to treat them.

2.3.
Stabilized approximate extended Kalman filter. The main pitfall in algorithm II described in the last section is that the matrix A in step (ii. c) relies on approximative matrix B * and thus does not guarantee non-negative definiteness of the result. This drawback violates the assumptions beyond the BFGS update formula, which may lead to an invalid low-memory BFGS approximation for the estimate covariance C est k+1 . Furthermore, this error migrates to the next iteration of the filter and potentially affects the prior covariance in step (i. b). Thence, the L-BFGS approximation in step (ii. a) becomes incorrect and algorithm II can be potentially vulnerable to accumulated errors.
In this paper we present correction for the step (ii. c) which eliminates the discussed vulnerabilities. This correction is based on the following simple lemma.
is computed in accordance with step (i. b) of algorithm II andC est k is assumed to be nonnegative definite. Then for an arbitrary symmetric matrix B * the matrix . is non-negative definite andC est k+1 is equal to C est k+1 when B * = A −1 (see algorithm I, step (ii. c)). Here I denotes the identity matrix of respective dimension.
Proof. For conciseness we denote C = C p k+1 , H = H T L k+1 , R = C η k+1 , G = CH T B * . The non-negative definiteness ofC est k+1 is proved by the following sequence of obvious matrix equalities: When B * = A −1 the above sequence of matrix equalities reduces to derivation of the Joseph stabilized covariance update [8], [12] extensively employed in the past to circumvent the problem of accumulated errors in Kalman filter. The main advantage over the original Joseph's approach is that B * is allowed to be any symmetric approximation of A −1 , which enables the use of a wide variety of methods approximating A −1 . For instance, stabilizing correction allows usage of SR-1 approximation [26] in place of B * , while L-BFGS EKF suggested in [2] assumed that B * was at least non-negative definite (even in this case giving no guarantee that corresponding estimate covarianceC est k+1 was proper covariance matrix with no negative eigenvalues).
In accordance with the claim of lemma 2.1 we can correct the step (ii. c) of algorithm II. We call the updated algorithm Stabilized Approximate EKF (SA-EKF).
Our next goal is to show that the suggested stabilizing correction (5) actually converges to C est k+1 when B * approaches A −1 . Furthermore, we will show that the rate of convergence for the stabilized approximation (5) is quadratic, whereas the standard approximation employed by algorithm II guarantees only linear convergence rate.
Lemma 2.2. Suppose that the assumptions made in lemma 2.1 hold. Then the stabilizing approximation (5) converges to C est k+1 as B * approaches A −1 . Furthermore, the following inequalities hold: Here · F r denotes Frobenius norm and · denotes matrix 2-norm.
Proof. We start by considering an auxiliary statement: for any matrices W and P of consistent dimensions W T P W F r ≤ P W 2 F r . Indeed, it is easy to verify where W i denotes the i-th column of matrix W and operator (·, ·) is convenience notation for the inner product. Hence, recalling the Cauchy-Schwarz inequality we see that F r and thus taking the square root of both sides of this inequality we arrive to the desired statement.
Our next goal is to estimate discrepancy betweenC est k+1 and C est k+1 .
Recalling the auxiliary inequality W T P W F r ≤ P W 2 F r and assigning P = we arrive to the following estimate: Putting together (8) and (9) we arrive to the desired inequality (6). Inequality (7) is proved analogically following the properties of matrix norm.
Lemma 2.3. ApproximationC est k+1 in the step (ii. c) of algorithm II converges to C est k+1 when B * approaches A −1 . Furthermore, the following inequalities hold: Proof. The proof is similar to the proof of lemma 2.2.
The claims of lemmas 2.1 and 2.2 can be generalized. Moreover, if some additional assumptions are made it is possible to show that stabilizing correction (5) improves initial approximation B * of A −1 .
Earlier in this section we have shown that approximate analysis covariance C est k+1 from algorithm II has better numerical properties if approximation B * of the matrix A −1 from this algorithm is replaced by (2I − B * A)B * . Hence, the question arises if this replacement term as well as the nice numerical features it brings to the covariance approximation can be described more generally. We begin by considering the following sequence of matrix valued functions F n (X) defined by recurrent relation: where A is defined in the same way as in lemma 2.1, X is an arbitrary symmetric matrix of the same dimension as A, and the recursion is initialized by F 1 (X) = X.
It is easy to verify by induction that A −1 is the fixed point of function F n (X), i.e. F n A −1 = A −1 , n ≥ 1 and that when n = 2 and X = B * , the expression for F n yields exactly the stabilizing replacement 2B * − B * AB * employed in (5). The next lemma shows that the matrix-valued function sequence F n (X) provides an easy way to improve approximation for A −1 preserving the advantageous properties of the estimate given in (5).
Lemma 2.4. Suppose that assumptions made in lemma 2.1 are met. Let F n (X) be a matrix-valued function sequence generated by the recurrent relation (12). Then for any symmetric X the following matrix is nonnegative definite: Moreover, the series of inequalities hold: Proof. Before delving into the details of the proof we note thatC est k+1 (n) defined in (13) includes the approximation from algorithm II and the stabilized correction (5) as special cases when n = 1 and n = 2 respectively.
We start by proving that matrixC est k+1 (n) is nonnegative definite. Foremost, we note that as far as X is symmetric, the matrix F n (X) is also symmetric for all n ≥ 1. Then, accounting for the claim of lemma 2.1 we conclude thatC est k+1 (n) is nonnegative definite when n ≥ 2.
To prove (14) we firstly derive the following auxiliary equality: where ∆ = A −1 − X and n ≥ 1. We proceed by induction: if n = 1, then F n = X and formula (17) holds. Next, we suppose that (17) is true for n = n 0 and prove that it remains true when n = n 0 + 1: Thereby, we proved that (17) holds for any n ≥ 1. We can now finalize the proof of (14): Let us assume that AX − I < 1. This implies F n (X) −→ In essence, sequence (12) corresponds to a matrix inversion formula of Newton-Schultz type. As follows from Lemma 2.4 F n converges to A −1 when initial approximation B * complies with requirement AB * − I < 1. This convergence result is not new, however the inequalities provided by lemma 2.4 are of interest as they make use of (12) in the framework of the Kalman filter, which was not analysed before. Under certain requirements assumed for B * , this result holds even for non-symmetric matrices, and moreover, it can be proved that when assumption of the matrix A being non-singular is relaxed, the sequence F n converges to Moore pseudoinverse A + as shown in [6]. In addition, properties of Chebyshev polynomials allow to generalize quadratic sequence F n by cubic recurrence formula, which provides accelerated convergence rate (see [27] for details). For the case of symmetric A it was proved in [27] that the initializer B * = (1/ A F r ) I guarantees that AB * − I ≤ 1 − 1/ m 1/2 k , where m is dimension of the matrix and k is its condition number. Therefore, such choice may not provide satisfactory speed of convergence when m is large and/or matrix A is bad-conditioned. Hence, generalized stabilizing correction (12) would not improve filtering properties until provided with a good initial guess B * . Deriving such initial guess or using Newton-Schultz series in approximative Kalman filter formulations other than discussed in this paper leaves room for further research. However, it should be mentioned that in our numerical experiments initializers computed by BFGS update formula failed to satisfy condition AB * − I < 1 when dimension of observation space increased over 100. 3. Two-layer Quasi-Geostrophic model. This section describes the two-layer Quasi-Geostrophic model [28]. The model is handy to use for data assimilation testing purposes as it is chaotic and its dimension can be adjusted by increasing the density of spatial discretization. At the same time the QG-model is relatively cheap to run and requires no special hardware to perform the simulations. However, the model is sufficiently realistic and leaves room for formulation of challenging inverse problems, so it can be used for testing purposes in performance analysis of data assimilation algorithms [21], [33].
In this section we discuss the governing equations of the QG-model and very briefly highlight the main stages of the numerical solver that we use to integrate the model. In our study we implemented and used the tangent-linear and adjoint codes of this solver. However, we skip the details here as the derivation of the TL-and AD-codes is rather straightforward but contains exhaustive amount of technical details.
The two-layer Quasi-Geostrophic model considered in this paper simulates the quasi-geostrophic (slow) wind motion. The model is defined on a cylindrical surface vertically divided into two interacting layers, where the bottom layer is influenced by the earth topography. The model geometry implies periodic boundary conditions along the latitude. The boundary conditions on the northern and southern walls (the cap and the bottom of the cylinder) are user-defined constant values. Aside from the boundary conditions at the "poles" and the earth topography the model parameters include the depths of the top and the bottom undisturbed atmospheric layers, which we hereafter denote by D 1 and D 2 respectively. In addition, the QGmodel accounts for the earth rotation via the value of Coriolis parameter f 0 . A possible geometrical topology of the QG-model is presented in figure I. In the figure the terms U 1 and U 2 denote mean zonal flows in the top and the bottom layer respectively.
The QG-model acts on the components of potential vorticity and stream function related to each other in accordance with the equations: Here q i and ψ i denote potential vorticity and stream function at the top layer for i = 1 and at the bottom layer for i = 2. The stream function ψ i operated by the model is analogous to pressure. More precisely, ∇ψ i = (v i , −u i ), where u i and v i are zonal and meridional wind speed components defined at the corresponding layer. In addition, it is assumed that potential vorticity respects substantial conservation law, which reads as follows: The terms Di· Dt denote substantial derivatives defined for the speed field (u i , v i ), i.e.
Di· Dt = ∂· ∂t + u i ∂· ∂x + v i ∂· ∂y . Equations (18), (19) and (20) fully qualify the two-layer QG-model. However, these PDEs are nondimensional. Nondimensionalization procedure for the QGmodel is defined by the following relations: where L and U are the length and speed main scales, β 0 is the northward gradient of the Coriolis parameter, g is the gravity acceleration, ∆θ is temperature change across the layer interface,θ is mean temperature, S(x, y) is a surface specifying the earth topography, η = U f0L is the Rossby number of the given system.
In the rest of this section we sketch out the integration method for the PDE system (18), (19), (20). The key idea is to assume that potential vorticity at the current time instant denoted as q i (t 0 ) is already known. Then in order to obtain the corresponding stream function we have to invert equations (18) and (19). By applying ∇ 2 to (18) and subtracting the sum of F 1 times (19) and F 2 times (18) from the result we arrive at: Equation (25) is nonhomogeneous Helmholtz equation with negative constant parameter −(F 1 + F 2 ), where ∇ 2 ψ 1 is unknown [29]. The boundary conditions for this auxiliary problem are inherited from the boundary conditions of the QG-model. A general method for solving (25) is discussed in [11]. Alternatively, it is possible to apply finite-difference scheme, which turns out to be stable when the parameter of Helmholtz equation is a negative constant. In our implementation we use the second approach.
Assuming that (25) is solved, we can compute the value for ψ 1 by solving the corresponding Poisson problem [17]. The solution is obtained by finite-difference method [24]. The value for ψ 2 can be deduced from (18) and (19) by substituting the obtained value for ψ 1 . Therefore, given potential vorticity for the current time instant we can compute the corresponding stream function in both layers. Finally, in order to account for the conservation law (20) we define propagation of potential vorticity over the time by a semi-Lagrangian method [31].
Thereby, the discrete numerical solver for the PDE system (18), (19) and (20) can be composed of three stages, where the first one employs semi-Lagrangian advection to propagate the system in time, and the rest include double application of finitedifferences to define the result of propagation over the spatial discretization grid. 4. Numerical experiments. In this section we present data assimilation benchmarks made to assess competitive performance of SA-EKF algorithm introduced in the previous sections.
The numerical experiments were carried out using the two-layer Quasi-Geostrophic model with several benchmarks. In our benchmarks we consider all values simulated by the QG-model on its two-level spatial grid stacked into a long vector, which we hereinafter refer as the state vector of the model. The natural vector space corresponding to the state vector is referred as the state space of the model.
Our benchmarks were run with increasing dimension of the state space. The intention was to assess performance of L-BFGS-EKF and of its stabilized version SA-EKF with respect to dimension of the problem. In addition, each benchmark itself included two parallel executions of the QG-model made at different resolutions. The simulation executed with the higher resolution, which hereafter is referred to as the truth run, was used to generate observations of the "nature". The lowerresolution simulation, referred to as the biased run, was employed as the transition operator M k in the Kalman filters.
In order to introduce an additional bias the truth and the biased runs were executed with different values for the layer depths D 1 and D 2 . In addition, the state of the truth run was not fully observed having only a few of its components continuously monitored. The locations of the observations were selected randomly.
Due to the difference in resolutions between the truth and the biased runs, the spatial locations of the observation sources and the nodes of the discrete grid of the values simulated by the biased run did not necessarily line up. This was circumvented by interpolating the state space of the biased run onto the state space of the truth run using bilinear interpolation. Hence, the observation operator H k was composed of two steps with the first stage implementing interpolation of the state space and the second stage performing projection of the interpolated space onto the subspace of the observations. Since the observed locations were selected once and remained unchanged during each assimilation run, the operator H k was independent of k. Consequently, in the experiments H k was linear and its own implementation was thus used as the TL-code. The corresponding AD-code was programmed by applying the chain rule to the AD-codes of the bilinear interpolator and of the orthogonal projector.
The TL and AD codes of the evolution were derived by a careful examination of the numerical solver implementing transition operator M k . The time step for the solver was set to 1 hour (this is the discretization time step used in integration of the PDEs (18), (19), (20)). The step of data assimilation was 6 hours. This means that a single state transition by M k comprised 6 repetitive evaluations of the underlying numerical solver. Therefore, the corresponding TL-and AD-codes utilized information from a nonlinear model trajectory tracing it forwards and backwards respectively and applying the chain rule. It should be noted that the prediction operator M k was fixed, i.e. independent of k in the simulations. The further details regarding configuration of the data assimilation and settings used by the model are described in the following subsections.
4.1. The covariance matrixes. The prediction and observation uncertainties were modelled by normally distributed random variables k , η k+1 (see equations (1) and (2)) with covariance matrices defined as follows: where σ 2 is the variance of model uncertainty, α is correlation coefficient between the layers, ξ 2 is the variance of observation measurement error, and I is identity matrix of dimension dictated by prediction and observation models. We assigned the following values for these parameters: σ = 0.1, α = 0.8, ξ = 0.5. The values used for α and σ were estimated from model simulations. The value of ξ 2 was selected in accordance with the perturbation noise applied to the observation data.

4.2.
The settings of the QG-model. The configuration of the QG-model used in our benchmarks was defined in accordance with the experiments performed by ECMWF [21]. The settings described here are aimed to create baroclinic instability in the model dynamics, which is a quasi-periodic regime qualified by seemingly "stormy" behaviour. This model regime can be used to efficiently analyse performance of data assimilation methods as its trajectories reside within a bounded region but do not converge towards equilibria. The model was simulated in a 12000km-by-6000km rectangular domain. The bottom surface orography was modelled by a Gaussian hill of 2000m height and 1000km wide located at coordinates (1200km, 9000km) counting from the lower-left corner of the simulation domain. The nondimensionalization of the model was defined by the main length scale L = 1000km and the main velocity scale U = 10m/sec. The rest of the model parameters introduced in (21) were assigned to the following values: f 0 = 10 −4 , ∆θ θ = 0.1, β 0 = 1.5 · 10 −11 . The values for model layer depths D 1 and D 2 were different for the truth and the biased runs to introduce an extra bias into prediction. These settings are described in detail in the next subsection.
4.3. The assimilation runs. As pointed earlier, we run two instances of the QGmodel in parallel, where the first instance has the higher resolution and simulates "the nature", and the second instance with lower resolution models prediction operator employed by the filters. The layer depths in the truth run were defined by D 1 = 6000m for the top layer and D 2 = 4000m for the bottom layer. The biased runs were executed using the layer depths of 5500m and 4500m, respectively. Prior to start of the data assimilation the truth and the biased runs were initialized with the same initial state (interpolated for the corresponding resolution of the grid) and were both executed for 10 days of the model time. As illustrated by figure III this caused notable divergence between the model states. The data assimilation began after the end of this 10 day spin-up period. The state of the biased run at the end of this period was used as the first initial estimate x est 0 . The uncertainty of this estimate was modeled "naively" by identity covariance, i.e. C est 0 = I. The goal of assessment was to test the ability of the assimilation methods to recover from the error in the initial estimate, account for the bias in transition operator, and accurately predict the state at the next time instant. As criteria for quality of assimilation we used two concepts: the root mean square error of the estimates, and the forecast skill (precise definitions are given below). Each data assimilation experiment comprised 500 assimilation rounds, which is equivalent to 125 days of the model time. The forecast skills were calculated skipping the time period where assimilation did not recover from the initial gross error, so that first steps of assimilation were not counted in the skill. For all filters a total number of 150 assimilation rounds was assumed to be a "satisfactory" recovery time. Figure II contains example root mean square errors of the estimates computed by the filters during the first 30 assimilation rounds. These curves demonstrate elimination of the error present in the initial estimate. Hereinafter, this process will be sometimes referred to as filter convergence.
In our experiments the model state vector was defined by discretized stream functions ψ 1 and ψ 2 (see equations (18), (19)). The Kalman filters were employed to correct predictions made by the biased run using the observation data obtained from the truth run. The quality of resulting SA-EKF estimates was assessed by root mean square error (RMS error) defined as follows: Here x est is the estimate of the true state x truth at a given time instant, N is dimension of these vectors, and · is Euclidean norm. Because in our benchmarks the truth and the biased runs are executed at different resolutions, the vectors x truth and x est have different dimensions. Hence, in order to give meaning to the numerator in formula (26) the estimate x est is interpolated onto the spatial grid of the corresponding truth run. An additional study of performance was done by the forecast skills. At each time instant, the estimates were propagated by applying transition operator M k . The resulting forecast was compared with the corresponding "real" state of the truth  run using the RMS error metrics. Hence, the resulting average over all assimilation rounds describes accuracy of the forecast for a given prediction range. The forecast skills were compared against standard baseline called "tourist brochure" (climatology), which is defined as the RMS error of the mean model state.
In all discussed benchmarks, the biased run had resolution that allowed explicit application of the exact EKF formulas. Therefore, the EKF estimates were used as a reference line for performance assessment of the approximative L-BFGS-EKF and SA-EKF. Our numerical experiment included three benchmarks with increasing resolution of the biased run (and therefore, even higher resolution of the corresponding truth run). Hereinafter in the text we will often refer to these benchmarks by numbers (i.e. benchmark 1, benchmark 2, and benchmark 3 in the order of resolution increase). Table I   0.4247 0.4346 0.2694 marked by "-" correspond to experiments, where the filter did not converge due to some of the eigenvalues of the estimate covariance being negative. For instance, the estimates diverged from the truth when L-BFGS-EKF was run having BFGS storage capacity of 5 iterations. In addition, numerical values marked by symbol "*" emphasize experiments that used inflation in the forecast covariance matrix. The inflation factor was equal to 1 where C p k is the forecast covariance matrix. This trick is heuristic and was employed to prevent spurious growth of magnitude in the estimated eigenvalues, which happens in case of high uncertainty modeled by coarse approximations. The accuracy of the estimates might be further improved with more judicious choice of the inflation factor [1]. Nevertheless, in our benchmarks the given choice was enough for the filter to recover from the strongly biased first initial value. It should also be emphasized, that we employed covariance inflation only in those experiments, which did not converge otherwise. In the experiments  marked by "-" covariance inflation with suggested inflation factor did not prevent the divergence of the filter. Table III contains results obtained from the same experiments in form of the forecast length. Here the forecast length is the time period between beginning of the forecast and intersection of the forecast skill curve with "tourist brochure". Time periods in Table III are rounded to hours of model time.

5.
Conclusion. In this paper we propose a stabilizing correction for L-BFGS approximation of the extended Kalman filter previously introduced in the literature. We call our reformulation of the algorithm SA-EKF to emphasize the difference. The suggested correction ensures that the covariance matrix of a filter estimate remains "proper" in the sense that all of its eigenvalues are non-negative.
Earlier results in [2] demonstrated that even the non-stabilized L-BFGS approximation converges for 'easy' non-chaotic large scale models, such as the heat equation, or smaller size chaotic systems, such as Lorenz95 with 40 state dimensions. However, when the state space dimension increases in a chaotic system, the L-BFGS EKF tends to require larger BFGS storage in order to converge, or may fail due to accumulating errors. It was shown that SA-EKF performs in a robust way, and is able to achieve satisfactory approximation results with considerably smaller number of BFGS iterations. 6. Appendix. The appendix section contains details about the BFGS method.
Here we prove the BFGS update formula and discuss its use in Quasi-Newton unconstraint optimization algorithms.
The L-BFGS unconstraint optimizer is a Quasi-Newton method for quadratic minimization (nevertheless, it can be extended to non-quadratic cost functions). Therefore, it is based on approximation of inverse Hessian of the cost function. The approximation addressed by the L-BFGS method is based on BFGS update, a rank-two update formula for matrix approximation. Below we present a sketch of derivation of the BFGS formula based on good Broyden 1-rank update.
6.1. Step 1. Good Broyden update. We begin by considering the following task: given a pair of vectors s, y, where s is a non-zero, and a square matrix A find matrix D such that (A + D)s = y and D has a minimal Frobenius matrix norm. In other words, we need to find a solution for the optimization task that obeys the following: (27) argmin D D F r : (A + D)s = y.
Lemma 6.1. The task (27) is solved by D = y−As s T s s T . Proof. The solution for the task (27) can be constructed explicitly. Consider an arbitrary matrix D. The Frobenius norm of this matrix can be defined as follows where tr(·) denotes the trace defined as the sum of matrix diagonal elements. Hence, in accordance with the matrix spectral theorem D F r = tr P LP T , where P is the matrix composed of the column eigenvectors of D T D and L is diagonal matrix of the corresponding eigenvalues. Since trace is invariant to the choice of basis, we can re-express Frobenius norm of the matrix D: where λ i ≥ 0 is the i − th eigenvalue of D T D. At the same time, we require Ds = y − As = r. Hence, s T D T Ds = r T r. Applying again the spectral theorem, we obtains T Ls = r T r, wheres = P T s is representation of vector s in orthonormal basis of the eigenvectors in P . Therefore, the task (27) can be reformulated.
Next, we assume thats 2 1 ≥s 2 i , ∀i. We thus suppose thats 2 1 is the biggest among the termss i (it is a legal assumption sinces 2 1 can always be made the biggest by changing the numeration of the eigenvalues λ i ). Hence, from the constraint equation (29) we obtain Substituting the expression for λ 1 into minimization problem (28) we eliminate the constraint (29).
Obviously, the sum in (30) is non-negatve and as far as we want to selected D so that is minimizes the task (27), we set λ i = 0 when i = 1. Hence λ 1 = r T r However,s 2 1 is determined by magnitude of orthogonal projection of vector s onto the first eigenvector of matrix D T D. Therefore, λ 1 is minimized by assigning the first eigenvector of D T D to s s . Thens 2 1 = s 2 and therefore D T D = s r T r s 4 s T , which leads to the obvious solution D = y−As s T s s T . This 1-rank update is known as good Broyden update.

6.2.
Step 2. BFGS update formula. Our next goal is to strengthen the constraints of the problem (27) by requiring the updated matrix A+D to be symmetric and positive definite assuming that the matrix A meets this requirement. However, there may be no solution satisfying the constraints. The following lemma gives a simple existence criterion. Proof. Indeed, assume that A + D is a symmetric positive definite update, which satisfies the constraint (A + D)s = y. Then we can apply Cholesky decomposition which yields (A+D) = ΣΣ T . Recalling the constraint, we obtain ΣΣ T s = y. Hence, setting z = Σ T s and L = Σ finalizes the first part of the proof.
Suppose now, that there is a non-zero vector z and a non-singular matrix L such that the lemma requirements are met. Then we can set A + D = LL T . Indeed, (A + D)s = LL T s = Lz = y, therefore the desired constraint is satisfied.
Next we derive the BFGS formula. In this paper we only provide a sketch of the proof, which was together with lemma 6.2 suggested in [16]. We need to emphasize that the derivation of the BFGS update formula is done for the case when the matrix being updated by the formula is positive-definite. However, earlier in this paper we employed a more weak assumption of non-negative definiteness. This weakening is possible due to the fact that BFGS and L-BFGS optimization algorithms can be implemented even when Hessian matrix of the cost function is singular [26], although the derivation of the BFGS update formula does not hold in this case. We start by considering a non-zero vector s and a positive definite matrix B such thatBs = y. Then lemma 6.2 implies that there is a non-zero vector z and a non-singular matrix L such that y = Lz and z = L T s. Hence, z T z = s T L L −1 y = s T y > 0.
Hereafter we assume s T y > 0. Our goal is to constructJ such thatB =JJ T s = y. Let us denote z =J T s and assume that z is already known. Then we can employ lemma 6.1 to update the given matrix J.
The equality z =J T s and expression (31) forJ allow us to write an equation with respect to the vector z.J This implies that z = αJ T s. Plugging the expression for z back into the latter equation, we obtain α = 1 + y T s − αs T Bs αs T Bs .
Hence, α 2 = (y T s)/(s T Bs) and as far as y T s > 0 and B is positive definite we can get a real value for α (it is not important whether the chosen value for α is positive or negative). Therefore z = (y T s)/(s T Bs)J T s. Substitution of the expression for z into (31) yields the desired formula forJ.
The proven lemma is fundamental for derivation of the classical form of the BFGS update. This can be done by plugging the derived formula forJ intoB =JJ T . Skipping the onerous calculations the substitution yields: In addition, the Sherman-Morrison-Woodbury matrix inversion lemma applied to (32) gives the BFGS update formula for inverse matrix: It is possible to prove that (33) defines a least change update for the positive definite B −1 givenB −1 y = s [14], [15]. More precisely, (33) provides solution for the following optimization problem: where B is assumed to be positive-definite. When applied to approximation of the Newton method the update formula (33) leads to the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm. For the purposes of the paper, the BFGS method is limited to quadratic minimization. However, the results given here can be extended to non-quadratic case [26].
Below we present a sketch of the BFGS algorithm. We adhere to the following notation: A and b are as defined in optimization problem (4), x 0 is an n-by-1 vector providing initial approximation for the cost function minimizer, h 0 is a scale factor of inverse Hessian matrix approximation initially assigned to h 0 I, N is the requested number of iterations, and x min is the obtained minimizer of the cost function f (x).
representation of the operator A −1 . Leveraging formula (32) the same mapped pairs can be employed to define a low-memory representation for the forward matrix A.
Wrapping out this section we should note, that the sketch given by algorithm III can not be considered the way to follow in order to get an efficient implementation of BFGS or L-BFGS. In addition, we do not discuss the question of choosing the initial Hessian scale h 0 and the cases of singular Hessian matrix. A more detailed analysis of these algorithms is available in [25]. 7. Acknowledgements. The contribution discussed in the present paper has been financially supported by the Finnish Academy Project 134937 and the Centre of Excellence in Inverse Problems Research (project number 250 215).