ROBUST PARAMETER ESTIMATION OF CHAOTIC SYSTEMS

. Reliable estimation of parameters of chaotic dynamical systems is a long standing problem important in numerous applications. We present a robust method for parameter estimation and uncertainty quantiﬁcation that requires neither the knowledge of initial values for the system nor good guesses for the unknown model parameters. The method uses a new distance concept recently introduced to characterize the variability of chaotic dynamical systems. We apply it to cases where more traditional methods, such as those based on state space ﬁltering, are no more applicable. Indeed, the approach combines concepts from chaos theory, optimization and statistics in a way that enables solving problems considered as ‘intractable and unsolved’ in prior literature. We illustrate the results with a large number of chaotic test cases, and extend the method in ways that increase the accuracy of the estimation results.

1. Introduction. A number of the recently published papers discuss the difficulties of parameter estimation which are introduced by the chaoticity of the models (see, e.g., [35], [37], [26], [29] for examples). The situation is usually dealt with as a 'black box' least-squares fitting problem. However, this approach is not appropriate for such systems on large time interval integrations, since a least squares solution does not even exist in the same sense as it exists for deterministic models. A new integration of the same system with the very same parameter values and initial values, but using a different solver or slightly different solver tolerances, typically leads to a totally different trajectory after an initial time interval over which the trajectory is predictable.
The problem of chaoticity is trivially avoided if, in addition to well-known initial values, enough data is available on a short enough time interval where the system remains predictable. However, this is hardly the situation with higher dimensional systems, such as turbulent flows or weather models. But the idea can be employed to construct likelihoods based on Kalman filtering. In particular, one divides the total integration time interval into subintervals, often called the assimilation windows, which are short enough for performing deterministic (predictable) simulations. The construction of a trajectory using the measurement data is then followed by the adjustment of a state vector at the start of each assimilation window. This process 'integrates out' the state vector, and can be used to construct an expression that only depends on data and parameters, i.e., the filter likelihood. While standard for linear time series and stochastic differential equations [30], [7], this approach leads to some tuning issues for chaotic systems; see [13] and [14]. We note that the use of predictable assimilation intervals is the essence of weather prediction algorithms, albeit basic Kalman filtering methods are not applicable due to memory issues in high dimensions, and so 4D-VAR methods are employed; see [17].
Another obvious approach to improve the estimation of the chaotic system parameters is to use averaging. In [15] we constructed a cost function based on zonal and temporal averages, simulated by the ECHAM5 climate model. While the stochasticity of the cost function was damped enough to get usual Markov chain Monte Carlo (MCMC) sampling technically working, this likelihood did not constrain the model parameters: naive summary statistics do not characterize the specific geometry of an underlying chaotic attractor in a meaningful way.
We show here that it is possible to construct a likelihood free of the above mentioned pitfalls, supposing that long enough time series data are available. In [12] a distance concept was introduced for chaotic trajectories with the goal of parameter sensitivity analysis. The purpose was to characterize the 'internal' variability of a given chaotic system with known system parameter values, in order to distinguish trajectories that deviate form this variability. The focus of the current paper is on providing a robust solution to the chaotic system estimation problem by using this distance concept. In particular, we provide: (i) the estimation of the chaotic system parameters from noisy data, including cases where the initial state values are unknown and the initial model parameters lie far away from those producing the attracting chaotic manifold (we note that in [12] the parameter values of a chaotic system were assumed to be known as the task was only to estimate the corresponding variability regions for these parameters); (ii) extensions of the distance concept, resulting in several improvements: more accurate confidence intervals for the estimated parameter values, estimating the dynamics, including the cases where the system involves vastly different time scales, or where the parameter estimation needs to be performed simultaneously for coupled chaotic and deterministic parts of a system. With several examples of chaotic systems we demonstrate how this likelihood construction allows for a robust parameter estimation, together with subsequent MCMC sampling of the parameter posteriors. We note that the proposed approach works even if the time intervals between observations are too long for the filtering based approaches to be applicable. Indeed, while the use of summary statistics to create cost functions or likelihoods is not new as such, see, e.g., [1], [22], [36], [15], the present approach combines concepts from chaos theory, optimization and statistics in a novel way that allows us to solve problems considered as 'intractable and unsolved' in prior literature; see, e.g., [29].
The rest of the paper is organized as follows. In Methods we discuss all the steps needed for performing parameters' uncertainty quantification of chaotic dynamical systems. In particular, we discuss how the correlation integral concept can be modified to create an empirical likelihood from a given time series data. In Extended feature vectors we extend the approach in a way that allows one to characterize the dynamics of a chaotic system. For the systems exhibiting 'slow' chaotic dynamics and 'fast' transitions to a chaotic manifold we show how to determine if a point belonging to a trajectory lies in a vicinity of a chaotic attractor or away from it. These results are important for identification of the portions of the trajectories on which the data must be collected to estimate the coefficient responsible for the 'fast' dynamics. Also, the algorithms discussed below require the randomization of the initial conditions to produce a set of trajectories belonging to an attractor; we indicate how to check if the initial conditions are located on or in a close vicinity of an attractor. The Results of the numerical experiments starts with the Lorenz63 system. We demonstrate the application of the approach for estimating the parameters from sparse data, with poor initial guesses for parameter values and unknown initial values of the state vector. In Lorenz63 with dynamics included, we present the results and discuss the guidelines for implementation. In Further numerical experiments we applay the method to the analysis of numerous well-known chaotic systems. We show how the proposed extension greatly improves the parameter identification, especially, in more complex cases. Brief discussion summarizing the results and addressing the next steps of the analysis and possible applications are included in Discussion and conclusions. The Appendix contains further details of the optimization method used, the derivation of some formulas discussed in the different time scales part of the text and the set up of all the numerical experiments.

Methods.
2.1. Parameter estimation. The procedure of parameter estimation in a standard, non-chaotic setting consists of three steps: constructing first the cost function as a statistical likelihood, based on known or assumed noise statistics of the measurement. Next, finding the maximum likelihood or MAP (maximum a posterior point) parameter estimate, by using some numerical optimization routine. Finally, finding the full posterior parameter distribution of the unknown, typically by the use of sampling methods as provided with several MCMC (Markov chain Monte Carlo) methods.
The main difficulty of estimating parameter values of a chaotic dynamical system is related to the fact that, indeed, no unique numerical solutions exist: even if the model parameter values and initial state values are fixed, the trajectories produced by numerical integrations with different numerical solvers, or slightly different solver tolerance choices, differ significantly on sufficiently long integration intervals. So, the standard construction of a likelihood function is not available. We will present here an alternative that enables parameter estimation in full analogy with the standard approach. However, the special nature of the likelihood construction imposes some requirements on the algorithms to be used.
We reiterate the basic ideas of the approach used in [12] and introduce the modifications required for the task of parameter estimation.

Correlation integral likelihood.
Denote by s = s (t, θ, x) a state vector s of a dynamical system, that depends on time t, parameters θ and other inputs x such as, e.g., the initial values. The measurements of the system s at the time points t 1 , ..., t N are denoted by s i = s (t i , θ, x) , i = 1, . . . , N . We consider two different trajectories s = s (t, θ, x) ands = s t,θ,x , obtained for different parameter values and inputs sets, (θ, x) and θ ,x , and evaluated at N time points t = {t 1 , ..., t N }. For R > 0, the modified correlation sum is defined as where # ( s i −s j < R) denotes the number of points of the different trajectories, lying within a sphere of radius R from each other. In the caseθ = θ andx = x the formula reduces to the well known definition of correlation sum. However, we are not interested in the small-scale limit R → 0 as in the classical definition of the correlation dimension, but use the expression (1) to characterize the distance between different trajectories at all relevant scales R. More spesifically, we can define a likelihood for a given dynamical system with a fixed parameter value θ as follows. We fix a set of radii R k , k = 1, ..., M and consider a vector y of dimension M with components from (1) with θ =θ. (2) For randomized valuesx = x the vector y is stochastic, with components that consists of averages. So by the Central Limit Theorem y can be expected to be Gaussian. Indeed, (2) gives the empirical cumulative distribution function (Ecdf) for the respective set of distances, with the values R k used as bins. The basic form of the Donsker's theorem states that the empirical distribution functions asymptotically tend to a Brownian bridge. In a more general setting, close to one employed here, the normality was established by Borovkova et.al. [2]. So the expression (2) provides a summary statistics that maps samples from a chaotic attractor into a Gaussian feature vector. In practice, one can compute a set of Ecdf vectors of the distances (2) by a training set obtained with a sufficient number of pairsx = x. By calculating the mean (ȳ) and covariance (Σ) of the y vectors, the Correlation integral likelihood (CIL) is obtained as the expression exp(− 1 2 (y −ȳ) T Σ −1 (y −ȳ)). Note that a similar synthetic likelihood construction was used in [36] for different criteria.
Next, let us discuss the construction of a similar likelihood function for the task of estimating unknown parameter values by data. Instead of creating the training set by simulations with perturbed values of x, we resort to subsampling of the measurements. Assume that we have a time series of N o measurements, large enough to cover all the characteristic features of the underlying chaotic dynamics. We create a training set for the CIL by sampling the representative subsets of length N of these data. The approach requires that all trajectories used in (2) contain samples that cover the underlying attractor in a sufficient way. For this we must assume that N o is large enough so that the subsamples of lengths N with this property can be constructed. Systems with rarely occurring events might yield sample sets that do not capture all such events. While this is not a serious issue in any of the numerous cases considered in the present work it may arise in more complicated situations. This question will be discussed in more detail in future elsewhere.
The bin values are given by R i = R 0 b −i , i = 1, . . . , M. The 'tuning' parameters of our method, R 0 , b and M , can be selected in a straightforward manner. R 0 is set to slightly exceed the maximum value of the distances computed by the recipe described above. The smallest radius R M = R 0 b −M should be large enough so that for all the feature vectors in the training set the spheres with the smallest radius R M contain neighbouring points. The remaining 'free' parameter is the number of bins M used to create the empirical cumulative functions. But the choice of M is the usual trade-off always present in constructing histograms or cumulative distribution functions. Once M and R M are selected, the base value b can be found by solving The normality of the ensuing distribution can be numerically checked using the χ 2 distribution test; see the Results section below. Note that the averaging process used in the construction of the likelihood diminishes the variability of the likelihood function to the extent that the optimization and MCMC sampling typically perform without major problems. However, even if the chaotic variability is 'tamed', the evaluation of the cost function is still stochastic, and appropriate algorithms are needed for performing the maximum likelihood optimization.
2.3. Optimization: Population methods. As the Gaussian likelihood is constructed, we can search for its maximum likelihood point, and perform the subsequent MCMC runs for estimating the parameter posterior distributions.
This implies that the optimization method used to find the maximum likelihood point should be robust with respect to non-smoothness. Moreover, as we may start the parameter estimation with the initial guesses taken quite far away from the optimal values, and perform each model integration with randomized initial state values, the optimizer should have robust convergence properties. For these reasons we use here a variant of the Differential Evolution (DE) algorithm, a population-based optimization method. Differential Evolution belongs to the class of Evolutionary Algorithms (EAs), proposed in [33]; also see [27]. It consists of four main steps, initialization, mutation, crossover and selection, which are applied sequentially to an ensemble or 'population' of parameter values until certain criteria are met. The name of DE comes from the particular way in which a 'mutation' is performed, in particular, the new candidates are obtained by means of scaled differences of vectors added to other vectors.
DE was initially designed for deterministic problems, but slight modifications to the original formulation enable us to use it as a robust optimizer for stochastic problems as well [31]. An issue in optimizing a stochastic cost function is the risk of finding a false optimal value simply due to the noise, i.e., the larger the noise -the more likely it is to 'freeze' at a non-optimal parameter value. The risk is diminished by the use of a population of parameters. Nevertheless, a common pitfall of population methods is the 'shrinkage' of the parameter values to a few or one point. Here we avoid this by a recalculation step: the cost function values are updated for all the parameter values in the current population. To save CPU, however, this can be done only at some selected iteration steps. Also, we note that with a stochastic cost function we do not even aim at convergence to a single maximum likelihood point, but instead want the values to fluctuate around the 'optimal' value. In the evolutionary language, this means preserving the diversity among the population. Indeed, the recalculation step ensures the diversity. While the optimization result is heuristic, the final parameter population values turn out to provide a very effective initial proposal distribution for a rigorous MCMC sampling, see the discussion below.
Another crucial reason to use DE here is the ease of parallelization, as each population member can be run independently from others, and synchronization is only needed at the selection step. This is beneficial, especially, in the case of heavy CPU models.
The DE algorithm can be implemented in various ways. Except for the recalculation step we follow one of the standard variants. All the details are given in the Appendix 1.
2.4. Uncertainty quantification: MCMC. After the convergence of the DE optimization we can study the parameter distributions by MCMC. Note that the sampling of a stochastic likelihood function can be interpreted as a pseudomarginal sampling, see [12].
In the examples we will see that, as an additional benefit of the diversity preservation, we may use the final part of the history of the optimization to create an effective initial proposal distribution for the adaptive MCMC methods (the AM or DRAM algorithms, see [10], [11]) used in the posterior sampling step. The MCMC process starts with the mean value of the final generation of DE optimization and the proposal distribution selected as the Gaussian distribution with the covariance calculated from several previous generations. Such a choice of the proposal distribution provides an effective acceptance rate from the beginning, and supports subsequent adaptation of the proposal distribution. Indeed, the role of the heuristic DE algorithm is to find good initial values for the rigorous posterior estimation by MCMC methods.
One may anticipate that the final DE populations could provide a good initial proposal for MCMC. As shown in the Results section, even more can be true: the DE samples give a good approximation of the parameter posterior distribution itself. Naturally, this is case dependent and should not be expected in all situations.

Extended feature vectors: Dynamics and different time scales.
The key idea in the above form of the estimation approach is to consider the computed or measured state vector values as a point cloud sample from the underlying fixed attractor, and to construct a likelihood based on the CIL feature vector. While the mapping from the point cloud to the CIL feature vector has preferable properties such as the Gaussianity, it also inevitably loses information. Indeed, the order of the points is irrelevant, as only the Ecdf of all the pairwise distances is used. Moreover, as only the phase space values are used, the dynamics of the system is lost. Here we introduce ways that enable the estimation of the dynamics of the systems. Note that in many applications dynamical systems may also contain vastly different time scales. So, we discuss both the 'overall' time evolution of a system, and ways to deal with systems consisting of fast or slow subsystems.
In order to enhance the information extracted from the measured point cloud, we can introduce several constructions in addition to the phase space values such as using the concept of embedded dimension. Here we focus on how to include the time behaviour of the system. We extend the state vector by the time derivative, i.e., consider the extended state vectors = (s,ṡ). During a numerical integration the values ofṡ are easily obtained by the right hand side of the differential equation system. The respective experimental values by noisy measurements must be calculated judiciously, as the task of numerical differentiation of noisy data is a well-known ill-posed problem. In our test cases we use just a basic statistical rule for selecting the time difference ∆t between the noisy observation points: the difference of observed consecutive state values should be 2 -3 times larger than the noise level of the data. Our examples demonstrate that the accuracy of parameter estimation is clearly improved, indeed, dramatically in many cases, by introducing such derivative information. Note that we do not need overall dense measurement points; sparse sets of pairs (t i , t i +∆t i ) are enough, and they are used in the example cases below.
The state values and the respective derivatives may have quite different scales, so we create different feature vectors for each of them. The final stochastic feature vector is obtained by concatenating the two vectors. The Gaussianity of the combined vector can again be confirmed in the same way as earlier. We give several examples detailing these steps below in the Results section.
Next, let us discuss of the reliable parameter estimation for chaotic systems with dynamics taken into account for the more complex situations, where different parts of the system may have vastly different time scales. In particular, assume that the fast dynamics corresponds to a stable transition of the solution trajectories starting "outside" of the "slow" chaotic manifold and reaching a quasi-steady state with the slow manifold.
For the original system of equations containing both fast and slow variables we may want to estimate the coefficient for both, the slow and the fast, parts of the dynamical system as a part of one procedure combining the classical deterministic and the chaotic system parameter identification approaches. However, the parameters of the fast transition cannot be identified as part of the CIL likelihood, but should be estimated from separate measurements taken before the solution trajectory reaches some vicinity of a slow manifold and then continues staying in a quasi-steady state with the manifold. To do that we need to make sure that the point with which we start the 'fast' parameter fit (as well as the data which we may want to use in the parameter identification process) indeed lies away from the chaotic attractor. On the other hand, in the CIL approach we should check that after randomization the initial values of the state variables, the points taken from the trajectory are, indeed, located on the slow chaotic manifold.
Both situations call for the methods that enable one to estimate if a given point is located close to or away from a 'slow' attractor. Below, we discuss the procedure, initially introduced in [24], that allows one to check if a particular point belonging to a solution trajectory, has a slow manifold (chaotic or deterministic) in its vicinity.
Although the procedure works in a more general form, where the original system contains several fast variables in addition to the slow ones here, for the sake of clarity, we discuss a special case of the system with a single fast variable which adjusts to the behavior of one chaotic variable, or a combination of chaotic variables; in particular, we may note that every system presented in Appendix 3 can be written in the following form: where the vector functions on the right-hand side of (3) are just the right-hand sides of the equations shown in Appendix 3; in (3), here " * " stands for transpose. Now we will only consider the chaotic systems with the differentiable right-hand sides; since the right-hand sides for the majority of the models presented in Appendix 3 are polynomials, this condition is obviously satisfied for them. Consider an extended system: where W is a "fast" scalar variable and K > 0 is a large constant associated with characteristic time of adjustment of W (t) to X(t). Before we may apply the results of [24] we need to re-write (4) in a special form considered in [24]. First, let us change the variable in (4): and thus,V Now we can introduce the notation for a new vector-function We can write the system for y(t) asẏ = g(y), where In what follows we will use the upper index " 0 " to denote the values of various expressions evaluated at (y(t 0 ), t 0 ). We re-write the systemẏ = g(y) in an equivalent form: here J 0 is the Jacobian matrix of (6), i.e., J = g y , evaluated at (y(t 0 ), t 0 ), and g(y, y 0 ) = g(y) − g 0 − J 0 · (y − y 0 ) = O(|y − y 0 | 2 ).
The notation "| · |" is used to the denote the Euclidean norm for vectors; in case of matrices, this will denote the matrix norm induced by the Euclidean vector norm: i.e., for a matrix A we have that |A| = ρ(A * · A), where ρ is the spectral radius of A. The (4 × 4) Jacobian matrix J 0 will have four eigenvalues of which three are the eigenvalues of the (3 × 3) matrix f 0 x ; for illustrative purposes, and without loss of generality, we assume that they are all real distinct: λ 1 = λ 2 = λ 3 ; similar procedure works in a more general case as well. The fourth eigenvalue is λ 4 = −K, with K 1. Let us introduce the variables transformation y = y 0 + T · u, where T is the matrix transforming J 0 into a diagonal form. Applying this change of variables to system (7), we obtain (see the details in Appendix 2; notations Λ 0 and Λ 0 1 used below denote the matrices of eigenvalues; they are defined in the formula eq. (S1) of Appendix 2)u = T −1 g 0 + Λ 0 · u + T −1g (y 0 + T u, y 0 ), which, when we use the notations is equivalent to (8) û = Λ 0 1 ·û +ĝ 0 +ĥ(û, y 0 ), u 4 = −K · u 4 +ĝ 0 4 +ĥ 4 (û, y 0 ). We note that from a special structure of the system (4) it follows that the vectorfunction (ĥ,ĥ 4 ) * does not depend on u 4 ! After re-writing the second equation of (9) as (9) 1 we can easily check that (9) and (8) are exactly in the form for which the results in [24] were derived. According to the simplified algorithm presented in [24] to verify if there exists a "slow" attracting locally invariant manifold in the vicinity of the point (y(t 0 ), t 0 ) defined as a ball of radiusR in the 4-dimensional phase space, we need to check the following inequality: If inequality (10) is satisfied, then the point (y(t 0 ), t 0 ) lies close to or on the slow manifold and the phase space of the system may be reduced from 4-dimensional to 3-dimensional (for K 1, which is equivalent to 1/K 1, the slow manifold in the leading order approximation will correspond to u 4 = 0 + O(1/K)); if, on the other hand, (10) is not satisfied, then the point (y(t 0 ), t 0 ) lies outsideR-vicinity of the manifold, and the reduction is not possible (which allows one to identify the value of parameter K if the data is collected at the appropriately (densely) chosen time points). An example illustrating this procedure is included in the Results section.

Results.
3.1. Example: Lorenz 63. We first demonstrate the steps described above and discuss the results for the classical example of the Lorenz 63: As for the data signal, we select the values of X and Y at 8000 time points uniformly distributed over the time interval [0, 80000], perturbed with Gaussian noise of the size corresponding to 5% of the signal. By purpose, we let the gap between the measurements be larger than the predictable time interval of the system. Figure 1 exhibits the initial stage of the analysis: an ensemble of 64 simulations with slightly varying initial values is shown. While the predictable time interval is roughly of length 7, the measurements are taken at intervals of length 10 apart. Examples of possible measurements are denoted in the picture (naturally, for the data signal the values of only one trajectory are selected). So, the measurement values are clearly impossible to predict over the assimilation window, which rules out the use of any of the standard filtering methods discussed in the Introduction. The likelihood function is constructed from the measured signal by a subsampling approach. We randomly select two disjoint sets of 2000 measurement vectors X i , Y i from the signal, compute the 2000 × 2000 matrix of Euclidean distances between the vectors from the different sets, and create the empirical cumulative distribution function of the distances in the log-log scale to get one of our feature vectors. This is repeated around 2000 times.
In the present case, we select the number of radii to be M = 10 and get R 0 = 2.51, b = 2.04. Finally, we can check the Normality of the ensuing distribution. Figure  2b shows that the histogram preserves the χ 2 distribution, and Figure 2a exhibits the variability of the feature vectors of the training set. Now that the cost function is constructed, the DE optimizer is applied. Figure 2c shows the convergence to a

3.2.
Lorenz 63 with dynamics included. Let us assume that all the settings for the data and the computational setup of the previous example hold. In addition, we consider also the Euclidean distances between the velocity vectors (Ẋ,Ẏ ). In the training set this requires that the additional measurements are available: the observation time points t i = 10 · i, i = 1, ..., 8000 are replaced with pairs (t i , t i + ∆t). The noise level is kept at 5%. The step ∆t must be large enough to allow an estimate of the derivative by the noisy data. Other than that, the approach is not too sensitive with respect to the choice of a time step. In the present example a simple Runge-Kutta ODE solver is employed, with time step of 0.01. We choose ∆t = 0.1, but note that any value in the range 0.1 < ∆t < 0.8 produces quite similar results.  (c) Lorenz 63: Posterior distribution and DE proposal. With dynamics included and also "K"-parameter estimated.  We map the sets of derivative distances into a log/log curve, measured at M = 10 different decreasing radii with base b vel = 2.0, see Figure 3a for plots of the feature vectors for both the state and the derivative values. To create a likelihood we concatenate the respective vectors into a 2M dimensional feature vector. It is then straightforward to check again the Gaussianity of the resulting stochastic vector, see Figure 3b for the numerical verification by the χ 2 distribution test.
The parameter estimation proceeds exactly as before, only with the likelihood function now using the extended feature vector. Figure 3c illustrates a noticeable improvement in the identification of the parameters after the derivative update to the likelihood function. Other examples (see below) can show even much more significant improvements.
The addition of the derivative values allows us to also estimate the overall speed of time evolution of the system, something not available for the basic CIL version. We demonstrate this by multiplying the right hand side of (11) by a constant K = 10 and estimating the four parameters α, β, γ, K. In this case, especially, the final DE population values are strikingly close to the MCMC posterior values. Note, however, that the DE posterior approximation typically has a larger uncertainty, as it consists of a relatively small sample of parameters, with no guarantee of ergodicity. Finally, let us consider the Lorenz63 system (11) as an example to illustrate the methodology of different time scales. The statement corresponds to the systems in the general form (3).
The extended system, corresponding to (4), is: where K > 0 is a large constant. Changing the variable (compare with (5)) V (t) = W (t) − X(t), and rewriting the system (12) in the form (7), where (X 0 , Y 0 , Z 0 , V 0 ) is some chosen point of interest lying on the solution trajectory andX = Now, let us consider the choice of parameters: α = 8/3, β = 10, γ = 28 and the case were the variable V is fast (K = 1000). We take such initial condition that the trajectory starts on the attractor for the variables X, Y , Z and off the attractor for V variable. Then we take a few points on the trajectory corresponding to different values of t and check the condition (10) for them. The results are presented in the Figure 4. With respect to the chosen value of R = 2 the points corresponding to t = 1.0025 and t = 1.0035, for which the condition (10) is satisfied, are considered belonging to the R-vicinity of the chaotic attractor.

Further numerical experiments.
To test the robustness of the approach further, simulations similar to those presented above for the Lorenz system have been performed for a large number of other chaotic systems. The respective equations, together with the CIL parameters used to create the likelihood functions, are listed in Appendix 3. In all cases our approach leads to consistent results. We used the choice M = 10 for the number of bins for all cases, and obtained the values for R 0 , R M and b with the same recipe as in the case of Lorenz 63 explained above. So we emphasize that the choice of these constants does not typically require any elaborate case-dependent hand tuning. However, one has to pay attention to selecting a long enough time interval for the data collection, to ensure that all the 'corners' of the attractor are represented in the training of the likelihood. Note that a natural requirement for successful simulations is that the system remains inside the same domain of attraction, both with respect to the initial values of the state vector and the model parameter values.
The number of parameters estimated for the cases listed in Appendix 3 varies from 3 to 17. As an example, here we briefly present the results for the, so called, Wang and Chua attractors.
The equations and the numerical setups for the experiments can be found in Appendix 3. Here we demonstrate again the impact of using the additional derivative information, as compared to the state values only.
For the Wang attractor [34] (13), examples of the velocity vector fields are shown in Figure 5a.
The 2D marginal parameter posteriors, obtained by a noiseless and noisy (with 5% Gaussian noise added again) data, respectively, are shown in Figure 5b. We see that the inclusion of the information on the dynamics dramatically shrinks the size of the parameter posterior distributions. Moreover, an inspection against the true underlying parameter values (α, β, γ, δ, , ζ) = (0.    Figure 6a. The equations of the system read as In the parameter estimations both the state values and the velocities are used. The 2D marginal posteriors between the 17 parameters are shown in Figure 6b, obtained by a noiseless and noisy (with 1% relative Gaussian noise added) data, respectively. It is remarkable that all the 17 parameters can be identified by the CIL likelihood construction with high accuracy. The noiseless posteriors have maximally around 2% relative error, while the noisy measurements yield posteriors with maximum 6% relative error. A reason for such a high accuracy is partly due to the special character of the system: even moderately small steps outside the sampled parameter distributions may lead to qualitatively different acctractors, with, e.g., some of the seven 'discs' missing. In the noiseless case the parameter posteriors show only mild correlations, while adding noise clearly increases the correlations between some of  4. Discussion and conclusions. Estimation of the parameters of chaotic dynamical systems is complicated by the fact that any changes in the numerical solution, including solver settings for an otherwise fixed system, lead to different trajectories beyond an initial predictable time interval. In this sense no standard solution exists for the task of parameter estimation by long-time trajectory data. A partial solution is provided by filtering methods, which essentially split the total time interval into predictable subintervals and follow the trajectory of data by various prediction/correction methods. But these methods fail if the time intervals between data points are too long for predictions. We present a new approach that is based on characterizing the set of trajectories corresponding to the given data, i.e., the underlying attractor, rather than trying to follow any specific trajectory close to the data. For this purpose we employ the distance concept introduced in [12], based on computing the statistics of the Ecdf of a point cloud of observations. A few 'tuning' parameters are needed for the likelihood construction, but they are usually found by a straightforward recipe.
With a large number of examples of chaotic systems we demonstrate how this approach allows one to perform the parameter estimation and subsequent MCMC sampling of the parameter posteriors. The initial parameter estimation is performed by the Differential Evolution method. It appears quite robust with respect to poor choice of the initial parameter values. Moreover, it often yields a remarkably good initial proposal for an adaptive MCMC sampling. No knowledge of initial values of the state vector are needed, as long as they stay in the domain of attraction of data. Indeed, the initial states are randomized and an initial time interval of simulations is discarded to guarantee that the point cloud used for the likelihood construction samples the underlying attractor. For more complicated models, with higher dimensions and consisting of coupled subsystems with different time scales, more careful analysis is needed. We take a step in this direction by discussing an approach that allows one to check whether a trajectory point has a 'slow' attractor in its vicinity.
The presented approach may be extended in various ways. Here we develop the inclusion of the time derivative of the state to the likelihood function. This, even if approximated from noisy observations, greatly increases the accuracy of the estimation results. Moreover, it allows the estimation of an overall time evolution rate of the system. An obvious next step is to consider high dimensional chaotic partial differential equation (PDE) systems. Challenges with increasing CPU times, both in simulating the dynamical systems and in calculating distance matrices can be treated in various ways: using parallel simulations and effective MCMC schemes, such as early rejection, parallel sampling [32] and Local Approximation MCMC [5].
The presented approach can also, with natural modifications, be applied to problems other than chaotic systems, such as stochastic differential equations or deterministic PDE systems that create random patterns from random initial values. The basic idea there is the same as in the examples treated here: instead of trying to predict an unpredictable solution, we characterize the variability of the outcomes, and create a Gaussian likelihood that enables a robust statistical analysis of the situation. Appendix 2. For the readers interested in practical implementations of the described algorithms in Appendix 2, we preset the derivation of the formula for matrix T . We look for T which satisfies following: Practically, we construct T in the form: such that and a 1 , a 2 , a 3 are the eigenvalues of f 0 x . The columns of T defined in (16) are the eigenvectors of J 0 . In particular, (0, 0, 0, 1) * is the eigenvector of J 0 corresponding to the eigenvalue λ 4 = −K. The constants b i , i = 1, 2, 3 may be determined from the condition that the vectors (a i1 , a i2 , a i3 , b i ) * , i = 1, 2, 3 must be the eigenvectors of J 0 corresponding to eigenvalues λ i , i = 1, 2, 3. Satisfying this condition produces the following equations for b i , i = 1, 2, 3: For each i = 1, 2, 3 in (19) the first three equations are satisfied automatically; from the fourth equation, and thus, It can be easily checked that Appendix 3. In Appendix 3, we present all the attractors for which we performed the study, as well as corresponding choises of the parameter values. Below f t denotes the final time of integration, N o is the number of measurements and N is the number of subsamples used to create the training sets.