Data-driven Evolutions of Critical Points

In this paper we are concerned with the learnability of energies from data obtained by observing time evolutions of their critical points starting at random initial equilibria. As a byproduct of our theoretical framework we introduce the novel concept of mean-field limit of critical point evolutions and of their energy balance as a new form of transport. We formulate the energy learning as a variational problem, minimizing the discrepancy of energy competitors from fulfilling the equilibrium condition along any trajectory of critical points originated at random initial equilibria. By Gamma-convergence arguments we prove the convergence of minimal solutions obtained from finite number of observations to the exact energy in a suitable sense. The abstract framework is actually fully constructive and numerically implementable. Hence, the approximation of the energy from a finite number of observations of past evolutions allows to simulate further evolutions, which are fully data-driven. As we aim at a precise quantitative analysis, and to provide concrete examples of tractable solutions, we present analytic and numerical results on the reconstruction of an elastic energy for a one-dimensional model of thin nonlinear-elastic rod.


Evolutions of critical points
Many time-dependent phenomena in physics, biology, social, and economical sciences as well as iterative algorithms in machine learning can be modelled by a function x : [0, T ] → H , where H represents the space of states of the physical, biological, social system, or digital data, which evolves from an initial configuration x(0) = x 0 towards a more convenient state or a new equilibrium. The space H can be a conveniently chosen Hilbert space. This often implicitly assumes that x evolves driven by a minimization process of a potential energy E : [0, T ] × H → R. In this preliminary introduction we consciously avoid specific assumptions on E , as we wish to keep a rather general view. Inspired by physics for which conservative forces are the derivatives of the potential energies, one can often describe the evolution as satisfying a gradient flow equation of the typė where ∇ x E(t, x) is some notion of differential of E (in the simplest case ∇ x may represent the Frechét derivative of the energy E ; in other cases it might already take into consideration additional constraints which are binding the states to a certain sets, i.e., x(t) ∈ K(t) ⊂ H ). Physical systems naturally tend to minimize the potential energy. For this fundamental reason the study of steady states in physical systems or critical points of the energy is of utmost relevance, given the expected frequency for such states to occur. However, once a critical point x * is reached, i.e., ∇ x E(t * , x * ) = 0, the dynamics is not supposed to further progress, unless some of the constraining conditions are changing, leading to a modified energetic profile. In this case, the evolution would restart and tend again by gradient flow to another critical point satisfying the new constraints. In view of the relevance of critical points, it is often of interest to exclusively observe their dynamics, rather than record the transitions between them. If we imagine now to collapse to an instant the time of realization of the -microscopically in time -gradient descent evolution, we could interpret the dynamics -macroscopically in time -as the instantaneous hopping from a critical point to another critical point. This time reparametrization can be rather conveniently realized as limit for ε → 0 of a singularly perturbed version of (1.1) for a rescaling parameter ε > 0 and a choice of x 0 fulfilling the criticality condition ∇ x E(0, x 0 ) = 0. In view of the vanishing parameter ε , the trajectories would tend in the limit to have unbounded velocity (in the rescaled time) and therefore classical compactness arguments, such as Ascoli-Arzelá Theorem, would fail to characterize limit trajectories for ε → 0 . Luckily recent works [1,27,28,30] explored ad hoc compactness methods along solution trajectories x ε under suitable smoothness assumptions on the energy E and certain generic conditions, so-called transversality conditions [20,2,30], on the sets of critical points C(t) = {∇ x E(t, x) = 0} (compare assumptions (E1)-(E4) below). The more restrictive assumption of all is perhaps the request for the state space H of being of finite dimension, i.e., H = R d .
We are informed of work in progress [3], which will relax this latter request to arbitrary Hilbert spaces, but in this present paper we will restrict ourselves to the available compactness results in [1]. Therefore, we will assume throughout this paper that indeed H = R d , which is enough for most numerical applications. In fact, any problem with infinite dimensional state space would eventually need to be discretized and reduced to finite dimensions in order to be numerically computable. The main compactness result in [1] may be summarized as follows Theorem 1.1 (Agostiniani and Rossi, 2017). Let ε n → 0 , let x n 0 → x 0 in R d , and let x εn be the solution of (1.2) associated to ε n and to the initial condition (x n 0 ) . Then, for all 1 ≤ p < ∞ , there exists a trajectory x ∈ L p ((0, T ), R d ) and a positive Radon measure ν ∈ M + b (0, T ) such that the following properties hold: (a) up to a subsequence, x εn → x in L p ((0, T ), R d ) for every p ∈ [1, +∞) , and pointwise for all t ∈ [0, T ] ; (b) for every t ∈ [0, T ] the pointwise limit function x(·) constructed in (a) admits left and right limits x − (t) and x + (t) , respectively, and x ± (t) ∈ C(t) ; (c) the set J := {t ∈ [0, T ] : ν({t}) > 0} is at most countable and coincides with the set of discontinuity points of x(·) ; (d) for every s, t ∈ [0, T ] it holds

Learning the energy from observation of the dynamics
The evolution of critical points t → x(t) obtained by Theorem 1.1 fulfills in particular the energy conservation principle (1.3) and it is fully driven and explained by the energy E itself. In some relevant cases the energy that governs a system can be derived theoretically or accurately measured experimentally, as it happens in the first principle physics; in most of the cases, the energy needs to be approximated by solving the inverse problem of fitting the data. Model selection and parameter estimation methods are employed to determine the form of the governing energy. Data-driven estimations are needed, for instance, in training algorithms in machine learning [9,14,15,17], and in data assimilation for models in continuum mechanics [21,10] computational sociology [7,23,22] or economics [4,11,16]. However, even the problem of determining whether time shots of a linear dynamical systems do fulfill physically meaningful models, in particular have Markowian dynamics, is computationally intractable [12]. For nonlinear models, the intractability of learning the system corresponds to the complexity of determining the set of appropriate candidate functions to fit the data. In order to break the curse of dimensionality of learning dynamical systems, one requires prior knowledge on the system and the potential structure of the governing equations. For instance, in the sequence of recent papers [29,25,26] the authors assume that the governing equations are of first order and can be written as sparse polynomials, i.e., linear combinations of few monomial terms.
In this work we aim at bridging, in the specific setting of deterministic evolutions of critical points, the well-developed theory of mean-field equations with modern approaches of approximation theory and machine learning. We provide a mathematical framework for the reliable identification of the governing energy from data obtained by direct observations of corresponding time-dependent evolutions. We would like to obtain results which ensure the learning of energies without the need of more restrictive assumptions than (E1)-(E4). Moreover, the approximation of the energy from a finite number of observations of past evolutions allows to simulate further evolutions, which are then fully data-driven.
First of all, we need to formalize what we mean by observations of time evolutions. In this paper we will assume that we are allowed to observe multiple realizations of evolutions of critical points for the same energy function E , starting from different critical points. For that, we need to further modify the model by a suitable correction ( 1.4) so that, whatever is the choice of x 0 , the system starts from an equilibrium. This simply means assuming that an additional force is added at the beginning to allow the state x 0 to be an equilibrium. Then we allow ourselves to draw at random independently several instances of the initial conditions x 1 0 , . . . , x N 0 , . . . according to a fixed probability distribution µ 0 ∈ P c (R d ) with compact support. For each of the picked initial conditions we can finally observe corresponding evolutions of critical points t → x i (t), for i = 1, . . . , N, . . . . We need then to devise a constructive method to infer the energy E from the observed trajectories. Our approach goes through five fundamental theoretical results: 1. Compactness of controlled evolutions of critical points. In the equation (1.4) a correction force has been added to ensure an arbitrary initial datum x 0 to be an equilibrium. As two trajectories x(t),x(t) originating from distinct initial equilibria x 0 ,x 0 , respectively, may intersect at any t ∈ [0, T ] and to promote a unique flow along characteristics (see Remark 2.5 and Remark 3.2), we modify further the model to take the form of an augmented controlled system: The particular the choice of f ≡ 0 and u 0 = ∇ x E(0, x 0 ) yields back (1.4). Our first result, Theorem 2.8, is the generalization of the compactness Theorem 1.1 to the controlled system (1.5). Although the system is not anymore in the form of gradient flow and Theorem 1.1 cannot be directly applied, we show that the techniques in [1] can be adapted to (1.5) without introducing significant technical issues. 2. Mean-field limit of evolutions of critical points. While the initial condition x 0 is distributed as µ 0 , we need then to clarify how the trajectories of critical points x(t) are distributed at any time t ∈ [0, T ]. Informally, we should explain how the initial probability distribution µ 0 gets transported along trajectories of critical points to the probability distribution µ(t) at any time t ∈ [0, T ], so that x(t) ∼ µ(t). We approach this issue under the modeling assumption that the evolutions of critical points are the result of singularly perturbed limit of systems of the type (1.5). In fact, for ε > 0 established results in gradient flow theory [6] allows to describe the evolution of any system of the type (1.5) (assume here for simplicity f ≡ 0 and u 0 = ∇ x E(0, x 0 )) by considering solutions η ε ∈ AC([0, T ], P c (R d × R d )) of mean-field equations of the type (see (3.5) In Section 3.2 we take advantage of the newly established compactness argument Theorem 2.8 and the superposition principle introduced in [6, Theorem 8.2.1] to derive Theorem 3.6 and Proposition 3.7 to describe the probability valued trajectory t → η(t) (below we use equivalently also the notation η t = η(t) ) representing the time dependent distribution of evolutions of critical points as a suitable form of limit of t → η ε (t) for ε → 0. The main characterization of the limit is given by which shows that the first marginal of η(t) is supported on critical-type points. 3. Mean-field limit of the energy balance. We further show that the evolution t → η t is also fulfilling in a suitable sense a generalization of the energy balance (1.3), Theorem 3.10, which explains how the energy E(t, x(t)) is actually distributed at the time t ∈ [0, T ] for the initial condition x 0 ∼ µ 0 . The result is obtained by a simple, but also thoughtful, reformulation of the energy balance using a Lebesgue charaterization of left and right limits and the use again of the compactness argument Theorem 2.8. To our knowledge, Theorem 3.6, Proposition 3.7, and Theorem 3.10 are the first form of mean-field limit of evolutions of critical points available in the literature. 4. A variational model for the energy learning. Inspired by the characterization (1.6), we formulate the problem of learning the true energy E responsible of driving the dynamics from observations of evolutions of critical points as the minimization of the functional on a suitable compact class in W 2,∞ of competitor energiesÊ . To make our approach fully constructive we actually assume to observe only a finite number N of evolutions of critical points and use a finite dimensional set V N W 2,∞ of competitorsÊ . By a Γ -convergence argument for N → ∞, we derive in Theorem 4.11 an approximation result of the true energy E , which was driving the observed dynamics. 5. Data driven evolutions of critical points. Once the energy is learned, it is then possible with the estimated energyÊ to simulate further evolutions. Corollary 4.14 guarantees that the simulated evolutions, which are fully data-driven, approximate "true" evolutions that would have been generated by using the original energy E .
The abstract framework described by the steps 1.-5. is actually fully constructive and numerically implementable. As we aim at a precise quantitative analysis, and to provide an example of tractable solutions, we approach the learning of the governing energy of the evolution for specific models inspired by continuum mechanics. In particular, in Section 5 we present analytic and numerical results on the reconstruction of the nonlinear elastic energy for a one-dimensional model of thin elastic rod.
Let us stress that this particular example is by no means the unique possible application of our general framework and we envisage many other possible applications for data-driven models in physics, biology, social, and economical sciences as well as training algorithms in machine learning.
Let (X, d) be a separable metric space. We denote with M b (X) the set of bounded Radon measures on X and with M + b (X) the subset of positive bounded Radon measures. The symbol P(X) stands for the set of probability measures on X , P c (X) indicates the set of probability measures with compact support in X , and P 1 (X) denotes the set of probability measures with bounded first moment, i.e., measures µ ∈ P(X) such that Let (Y, d ) be another separable metric space, r : X → Y a Borel map, and µ ∈ M b (X) . We define the push-forward r # µ ∈ M b (Y ) of µ through r by the relation r # µ(B) := µ(r −1 (B)) for every B Borel subset of Y . For every µ, ν ∈ P 1 (X) , the 1-Wasserstein distance W 1 (µ, ν) is defined by (see, e.g., [6, Section 7.1]) where Γ(µ, ν):= {γ ∈ P(X × X) : (π 1 ) # γ = µ and (π 2 ) # γ = ν} and π i : X × X → X , π i (x 1 , x 2 ) = x i for i = 1, 2 . We also recall that if µ, ν ∈ P c (X) it holds Finally, given an interval I ⊆ R , we denote with BV (I) the space of functions of bounded variations in I , that is, the space of L 1 loc functions v : I → R whose distributional derivative Dv belongs to M b (I) . We refer, for instance, to [5,Section 3.2] for further details.

Main assumptions and motivations
We start by fixing the main assumptions, which we will hold valid for the rest of the paper. Let T > 0 and let us consider an energy functional E : [0, T ] × R d → R satisfying the following conditions: contains only isolated points.
Remark 2.1. Let us briefly comment on the above assumptions. Hypotheses (E2) is typical in the framework of evolutions of critical points or rate-independent systems, and it is useful to prove the boundedness of trajectories exploiting Gronwall's type arguments. Condition (E3) implies the more common compactness of sublevels of the driving energy E . The need of an explicit bound from below will be clarified below (see in particular (2.1), Lemma 2.4, and Proposition 2.6). Finally, assumption (E4) has been considered, e.g., in [1,27,28,30], in the uncontrolled case u = 0 of equation (1.2). This kind of hypothesis has been proven, at the state of the art, to be quite useful to show compactness of trajectories of (1.2) in the limit as ε → 0 . In this paper, we focus on the perturbed system (2.1), where a control u ∈ R d is added. In order to be able again to show compactness of trajectories of (2.1) as ε → 0 , we need the stronger requirement (E4). We refer to Section 2.2 for a discussion on the compactness issue. Roughly speaking, we need that the energy E(t, ·) has no affine regions, for every t ∈ [0, T ] . We remark that (E4) and its corresponding assumption (E 3 ) in [1] are both technical and likely equivalently "artificial". It is indeed clear that (E4) implies (E 3 ) . On the other hand, if E does not satisfy (E4) for some u , then the linear perturbation E(t, x) − u · x does not satisfy (E 3 ).
Here we are interested in studying the system (1.4). As already mentioned in the introduction, the reason for adding the term ∇ x E(0, x 0 ) to the usual gradient flow system (1.2) is twofold. On the one hand, in the limit as ε → 0 we want to avoid jump discontinuities at time t = 0 and ensure that x 0 is an equilibrium from the very beginning. Since the limits of trajectories of (1.4) are expected to satisfy ∇ x E(t, x(t)) = ∇ x E(0, x 0 ) , jumps at t = 0 will not appear. On the other hand, the drift ∇ x E(0, x 0 ) can be exploited to add randomness to (1.2). This can be done simply by assuming that the initial data are distributed according to a certain probability measure µ 0 ∈ P(R d ) . In fact, in what follows, we aim first at obtaining the mean-field description of (1.4) for fixed ε > 0 (Section 3, standard), and then pass to the limit in the mean-field (or continuity) equation as ε → 0.
Remark 2.2. On the one hand, we notice that for every initial datum x 0 ∈ R d there exists unique a solution to (1.4). On the other hand, we could show easily some examples of energies E satisfying (E1)-(E4) and such that, for two different initial data x 0 ,x 0 ∈ R d , the corresponding solutions of (1.4) cross each other at some time t ∈ (0, T ). For this reason, it is more convenient to study (1.4) and its mean-field limit in terms of pairs curve-initial datum (x(·), x 0 ) , in order to ensure uniqueness of transport along characteristics, see Remark 2.5 below.
In view of the above comments, we consider the more general system We further assume Assumption (E5) ensures well-posedness and continuity of solutions from initial data of In the following lemma, we collect the properties of E .
Lemma 2.4. The following facts hold: for every x ∈ R d and every u ∈ K ; Proof. Property (b) follows from the smoothness of E . Statement (a) follows from (E2) and (E3). Indeed, By construction of E , we have that Choosing δ > 0 so small that δ/C 3 < 1/2, we get that Therefore, we deduce (2.2) as soon as u ∈ K .
Proposition 2.6. Let (x 0 , u 0 ) ∈ R d × R d and let ε > 0 . Let (x ε (·), u ε (·)) be the solution of (2.1). Then, the following facts hold: It is useful to observe that this bound implies is differentiable. The energy balance in (a) follows by chain-rule, recalling the time derivatives as in (2.1). We report below the explicit computation as it will be quite useful several times below. First of all we notice that from (2.1) By integration we obtain (a). We address now (b). Being u ε (t) uniformly bounded in terms of initial datum u 0 , applying Lemma 2.4 and arguing as in (2.3) we get that for some C 1 , C 2 > 0 independent of ε . By Gronwall lemma, In view of (E3) and of the boundedness of u ε , we get that x ε (t) is bounded in R d uniformly with respect to ε and t ∈ [0, T ] Hence also (b) is proved. Finally, property (c) is a consequence of (a) and (b) .
Remark 2.7. The boundedness of the trajectories (x ε (·), u ε (·)) in Proposition 2.6 can be made independent of the specific initial datum if we consider initial data (x 0 , u 0 ) in a fixed compact subset K • of R d × R d . This assumption will be tacitly applied from now on.

Compactness of trajectories
In this section we prove the compactness of trajectories (x ε (·), u ε (·)) fulfilling (2.1) as ε → 0 . In particular, we show how to adapt the arguments of [1] in order to take into account also the control u. In fact the second equation in (2.1) does not obey a gradient flow, hence does not allow a direct application of Theorem 1.1. For the sake of simplicity, from now on we set Λ := C([0, T ]; R d ) and, for p ∈ [1, +∞) , In order to describe the energetic behavior of a limit of (x ε (·), u ε (·)) for ε → 0, for every t ∈ [0, T ] , every x 1 , x 2 ∈ R d , and every u ∈ R d we define the cost function c t (x 1 , x 2 ; u) as otherwise, where A t x1,x2,u is the set of admissible transitions from x 1 to x 2 at time t where C(t, u) is as in assumption (E4). Notice that the cost function (2.6) and the class of admissible transitions (2.7) are modified versions of corresponding of cost and admissible transitions in [1, Definition 2.2]. As clearly stated in the following theorem, the cost (2.6) describes the energy dissipated by the limits of (x ε (·), u ε (·)) at jump points.
With the notation introduced above, we can now state the main compactness result of this section. It generalizes Theorem 1.1 (see [1,Theorem 1]) for controlled systems of the type (2.1).
and let (x εn , u εn ) be the solution of (2.1) associated to ε n and to the initial condition (x n 0 , u n 0 ) . Then, there exists a pair (x, u) ∈ Γ ∞ T × Λ and a positive Radon measure ν ∈ M + b (0, T ) such that the following properties hold: (a) up to a subsequence, x εn → x in Γ p T for every p ∈ [1, +∞) , and pointwise for all t ∈ [0, T ], and u εn → u uniformly in Λ ; (c) for every t ∈ [0, T ] the pointwise limit function x(·) constructed in (a) admits left and right limits x − (t) and x + (t), respectively, and x ± (t) ∈ C(t, u(t)); (e) for every s, t ∈ [0, T ] it holds To prove Theorem 2.8 we follow the steps of [1]. For the reader convenience, we show the main changes in the proofs, but we may refer to [1] for concluding details, which do not require modifications. We start with the analysis of useful properties of the cost function c t (see Proposition 2.10 and Remarks 2.11 and 2.13 below). Lemma 2.9. Let K be a compact subset of R d , u ∈ R d , and let t ∈ (0, T ) be such that Then, there exists α > 0 such that Proof. It follows from the continuity of the function We state and prove now a result, which generalizes [1, Proposition 4.1].
Then, the following facts hold: Proof. Let us show (a). First, we notice that, by (2.2) and by an application of the chain rule, θ n (s) is Indeed, initial and final points θ n (t i n ) converges to x i , and thus are bounded. Moreover, by chain rule, we have that for every s In view of Lemma 2.4 and of the assumption (2.9), we deduce that θ n (s) is uniformly bounded in R d . Let us denote by Θ the compact subset of R d containing θ n (s) , s ∈ [t 1 n , t 2 n ]. Let us assume by contradiction that x 1 = x 2 . Thanks to condition (E4), the set C(t, u) ∩ Θ is finite. If this were not the case and (x i ) i∈I would be an infinite family of critical points, then we could extract from it a converging subsequence x k → x ∈ C(t, u) ∩ Θ, in view of the continuity of ∇ x E , and x would not be isolated, violating (E4). Hence, there exists δ > 0 such that Applying Lemma 2.9, we may assume, up to taking a smaller δ > 0 , that For n sufficiently large we have that t i n ∈ [t − δ, t + δ] and u n (s) ∈ B(u, δ) for every s ∈ [t 1 n , t 2 n ]. By definition of K δ and by the previous properties, we have that the set {s ∈ [t 1 n , t 2 n ] : θ n (s) ∈ K δ } = Ø and there exist s 1 , s 2 ∈ {s ∈ [t 1 n , t 2 n ] : θ n (s) ∈ K δ } such that s 1 = s 2 and θ n (s i ) ∈ ∂B(x i , δ) for i = 1, 2 . Therefore, by (2.13) we have This contradicts the hypothesis (2.9). Let us now prove (b). Let δ , K δ , and e δ be as in (2.11)-(2.13). Up to extracting suitable subsequences, we can set We reparametrize the time interval in the following way: we first define the strictly increasing function , where s 1 n := s n (t 1 n ) and s 2 n := s n (t 2 n ). We setθ n (s):= θ n (r n (s)) andũ n (s) := u n (r n (s)). In particular,θ n (s i n ) → x i andũ n (σ) → u uniformly for σ ∈ [s 1 n , s 2 n ] . Moreover, by change of variables, (2.14) Notice now that r n (s) = (1 + |∇ x E(r n (s),θ n (s)) −ũ n (s)||θ n (r n (s))|) −1 .
hence |∇ x E(r n (s),θ n (s)) −ũ n (s)||θ n (s)| = |∇ x E(r n (s),θ n (s)) −ũ n (s)||θ n (r n (s))||r n (s)| = |∇ x E(r n (s),θ n (s)) −ũ n (s)||θ n (r n (s))| 1 + |∇ x E(r n (s),θ n (s)) −ũ n (s)||θ n (r n (s))| ≤ 1 , andθ n has finite speed on A δ . The construction of the limiting function works from now exactly as in the proof of [1, Proposition 4.1] taking into account thatũ n (s) is uniformly close to u. Let us explain informally how the construction work and we refer to the above mentioned reference for more details: the sequenceθ n is equibounded and, in view of the finite speed it is also equicontinuous. Up to a further linear time reparametrization, it admits by Ascoli-Arzelà Theorem a limit θ ∈ C([0, 1]; R d ) such that θ(0) = x 1 , θ(1) = x 2 . Moreover, again in view of the finite speed, this limit cannot travel to an infinite number of critical points of minimal distance 2δ . Hence, it visits at most a finite number of them and θ ∈ A t x1,x2,u .
Remark 2.11. From Proposition 2.10 it follows that We now state an autonomous modification of Proposition 2.10, in which the time parameter t is fixed. The result corresponds to [1,Proposition 4.5].
n ,u . Then, the following facts hold: Proof. The proof can be carried out as in [1,Proposition 4.5] working with the energy E(t, ·, u) with fixed parameters t and u.
Remark 2.13. As a consequence of (b) of Proposition 2.12, whenever c t (x 1 , We show now two results useful to describe the energetic behavior of the limits of sequences (x εn , u εn ) .
Proposition 2.14. Let us set Then, the following holds true: (a) there exists a positive Radon measure ν ∈ M + b (0, T ) such that, up to a subsequence, ν n ν is at most countable and coincides with the jump set of E .
Proof. In view of Proposition 2.6 and estimate (2.5), ν n is bounded in mass, uniformly with respect to n , and up to a subsequence, ν n ν weakly * in M + b (0, T ) . If we consider the function F n (t) given by then F n is a sequence of bounded non-increasing functions. Hence, by Helly Theorem, it admits, up to subsequence, a pointwise limit F ∈ BV (0, T ) . From (E2) and Proposition 2.6 (b) the func- is uniformly bounded with respect to t ∈ [0, T ] and it admits a weak * limit G ∈ L ∞ (0, T ) . Therefore, the function E(t): . The rest of the proof goes precisely as in [1,Proposition 5.2]. In particular, Lemma 2.15. Under the assumptions and notations of Proposition 2.14, let t i n → t , i = 1, 2 , leṫ u n (s) = f (u n (s)) and u n ⇒ u uniformly for s ∈ [t 1 n , t 2 n ] , and let where J is as in Proposition 2.14 (d).
Proof. For every τ > 0 The thesis follows from Proposition 2.10.
We are now ready to prove Theorem 2.8.
Proof of Theorem 2.8. Let us denote In view of (c) of Proposition 2.6, Since I is at most countable, we may fix a suitable subsequence such that x εn (t) → x(t) for every t ∈ I , for some limit x(t) ∈ R d . Clearly, we already have that u εn ⇒ u uniformly in [0, T ] , due to the continuity with respect to initial data of the equationu = f (u) .
For t ∈ [0, T ] \ I we definẽ Let us show thatx(t) is well defined for every t ∈ [0, T ] \ I . It is clear that, at least along a subsequence, the limit in (2.18) exists for every sequence s k in A converging to t . We have to prove that it is unique.
for every k , so that we may find a suitable subsequence ε n k such that x εn k (t i k ) → x i as k → ∞. Applying Lemma 2.15 and recalling that t / ∈ J , we get that We notice that, by construction and by continuity of . It suffices to show it for t ∈ [0, T ] \ I , since, by construction, the convergence is already satisfied in I . Assume that for some t / ∈ I we have x εn (t) →x ∈ R d . Fix a sequence t k ∈ A converging to t . In particular, x(t k ) → x(t) by definition. Again, we may fix a subsequence ε n k such that x εn k (t k ) → x(t) . Applying Lemma 2.15 and recalling that t / ∈ J , we get thatx = x(t) . Hence, . Moreover, this implies that the function G determined in (c) of Proposition 4.4 actually coincides with Let us now show that x admits left and right limits for every t ∈ [0, T ] . Let us focus on the existence of x + (t) . Let t 1 k , t 2 k t , and let us assume, without loss of generality, that t 1 k < t 2 k . Up to a subsequence, we may further assume that . Furthermore, by the convergence of x εn (t i k ) to x(t i k ) and of u εn (t i k ) to u(t i k ) as n → ∞ for every k ∈ N and for i = 1, 2, we have that we can construct a suitable subsequence ε n k such that Hence, rewriting the energy balance (2.4) for every k and passing to the limit as k → ∞ we get that This implies that x 1 = x 2 , and the right limit It is now straightforward to see from (c) of Proposition 2.14 that the energy balance . For the opposite inequality, in view of Proposition 2.12 and Remark 2.13 we assume that ,u(t) the optimal transition between x − (t) and x + (t) . By chain rule we have that This concludes the proof of the theorem.
3 Mean-field limit of evolutions of critical points 3.1 Mean-field limit for ε > 0 In this section we deduce the mean-field limit of the ODE system (2.1). Although this is by now a standard procedure, see, e.g., [6,Chapter 8], we show here the details of the passage to the mean-field limit in order to stress the dependence on the auxiliary control variable u introduced in (2.1). Let Remark 3.1. In agreement with the initial value problem (1.4), we could, for instance, imagine that the measure η 0 ∈ P c (R d ×R d ) takes the form η 0 := (id, ∇ x E(0, ·)) # µ 0 for some µ 0 ∈ P c (R d ) . This represents exactly the case where the initial control u 0 is ∇ x E(0, x 0 ) .
which corresponds to the weak form of (3.2) Remark 3.2. Continuing the discussion of Remark 2.2, we want here to further justify the choice of working in the space P where also the control parameter u has its own distribution, and not in P(R d ) , where only the space variable x would be described by a probability measure. Let us indeed consider the simpler setting of (1.4) with the initial data In order to obtain an integral formula as in (3.1), one could try to define a flow X ε,t (x 0 ) that associates to each x 0 the value at time t of the solution of (1.4) starting at x 0 at time t = 0, and plug its inverse into the last term in (3.3). However, as noticed already in Remark 2.2, this is not possible, since for distinct initial data x 0 ,x 0 it could happen that the two trajectories cross each other at time t . Hence, we can not deduce a continuity equation for the sole distribution of x 's.
We want to pass to the limit in (3.2) as N → ∞ . We notice that, in view of (b) of Proposition 2.6, the support of η N ε,t is bounded in R d × R d uniformly with respect to t ∈ [0, T ] , N ∈ N, and ε > 0. In order to identify a continuous in time limit η ε ∈ C([0, T ]; P(R d × R d )) of η N ε , in the following lemma we estimate the Wasserstein 1-distance between η N ε,ti and η N ε,t2 for t 1 , t 2 ∈ [0, T ] (equicontinuity).
for some positive constant C ε depending on ε > 0 but not on t 1 , t 2 , and N .
Proof. Since η N ε is an empirical measure, we simply have that where we have used the system (2.1), the boundedness of x i ε , u i ε , and the hypotheses (E1)-(E5) on E and on f .
in the sense of distributions.
Remark 3.5. According to [6, Section 8.1], we can pick the measure Therefore, η ε solves (3.5) in the sense of distributions. By [6, Section 8.1], the solution of (3.5) is unique. Thus, the whole sequence η N ε converges to η ε uniformly with respect to W 1 .

Mean-field limit for ε → 0
In order to derive a mean-field limit of evolutions of critical points, we wish to take advantage of the superposition principle introduced in [6, Theorem 8.2.1]. Accordingly, let us define the probability where we consider the flow Y ε also as a function of time. By definition of Π ε , for every For every bounded continuous function ϕ : [0, T ] × R d × R d → R , we may consider (3.6) also integrated in time. In particular, by Fubini, x(t), u(t)) dt dΠ ε (x 0 , u 0 , x(·), u(·)) . (3.7) We notice in particular that the function from Γ p T × Λ to R (x(·), u(·)) → T 0 ϕ(t, x(t), u(t)) dt is continuous.
Theorem 3.6. There exists Π ∈ P(R d × R d × Γ p T × Λ) such that, up to a subsequence, Π ε narrowly converges to Π . Moreover, every accumulation point Π of Π ε satisfies Proof. By definition, for every ε ∈ (0, 1] the support of Π ε is contained in the subset of where we denote with cl(A) the closure of A . In view of Theorem 2.8 the above set is compact in Γ p T × Λ with respect to the strong topology of L p and the uniform convergence in Λ . Therefore, by Prokhorov theorem there exists a measure Π ∈ P(R d × R d × Γ p T × Λ) such that, up to a subsequence, Π ε narrowly converges to Π. Moreover, for every Taking ϕ(t, x, u) any bounded continuous extension of |∇ x E(t, x) − u| 2 outside K • (see Remark 2.7) as a test function and recalling Proposition 2.6 (c), we get where Y 1 ε and Y 2 denote the two components of the flow Y ε associated to system (2.1). Notice that Y 2 does in fact not depend on ε . Combining (3.9) and (3.10) we deduce (3.8).
Proof. The measure η ε,t ⊗ L 1 |[0,T ] is (up to a rescaling) a probability measure with compact support in R d × R d × [0, T ] independent of ε > 0. In view of the structure of η ε,t ⊗ L 1 |[0,T ] , we have that there exists a Borel family {η t : t ∈ [0, T ]} of measures in P(R d × R d ) such that, up to subsequences, η ε,t ⊗ L 1 narrowly converges to η t ⊗ L |[0,T ] . From equality (3.6) and from Theorem 3.6 we deduce formula (3.11). In a similar way we get (3.12).
As for (3.13), given φ and ϕ as in the statement of the theorem, we simply test the continuity equation (3.5) Passing to the limit as ε → 0 in the previous equality we obtain (3.13).

Mean-field energy balance
While Theorem 3.6 explains that evolutions of critical points are distributed essentially as η t , we would like to clarify in this section how the energy E(t, x(t)) is distributed at time t ∈ [0, T ] given that t → x(t) is an evolution of critical points as derived in Theorem 2.8 with random initial data (x 0 , u 0 ) distributed as η 0 . Ideally we would expect that E(t, x(t), u(t)) is distributed as E(t, ·, ·) # η t . Unfortunately, the lack of smoothness of the trajectories t → x(t) does not allow to obtain such a mean-field description of the energy, but we are able to derive below a slightly weaker form of it, which leverages again the superposition principle and Lebesgue point description of left and right limits.

15)
wherex(·) is a representative of x(·) with left and right limitsx ± (t) at any t ∈ [0, T ], and ν (x(·),u(·)) is the positive measure of Proposition 2.14 (a), Proof. As mentioned in the proof of Lemma 3.9, from Theorem 3.6, for any (x(·), u(·)) ∈ supp(Π * ) there exists a sequence ((x εn (·), u εn (·))) n of solutions of (2.1) converging to (x(·), u(·)) in Γ p T × Λ for any vanishing sequence (ε n ) n . In particular, x εn (t) converges up to subsequences to x(t) for almost every t in [0, T ] . However, from Theorem 2.8 (a)-(c) there exists yet one more subsequence (not relabeled), which converges pointwise to a trajectoryx ∈ Γ p T , possessing left and right limitsx ± (t) at any t ∈ [0, T ] . Hence,x and x coincide almost everywhere. In particular, the following integrals must coincide Therefore, by Theorem 2.8 (d), we have where ν (x(·),u(·)) is the positive measure of Proposition 2.14 (a). From Lemma 3.9 we are allowed to consider the integration of these identities with respect to Π * (or Π), to eventually obtain (3.15). We conclude noticing that, by Carathéodory extension theorem, the integrated measures define a positive Radon measure, which we denote V = Remark 3.11. It would be very tempting to write but, unfortunately, the function t → E (t,x ± (t), u(t)) is only measurable and it would not be possible to obtain such a pointwise identity; a further integration in time may be needed in order to express the identities in terms of integrations with respect to η . Nevertheless, the identities (3.15) hold true pointwise for all 0 ≤ s ≤ t ≤ T ; it is perhaps a bit more abstract energy balance principle as one may have expected, but it is also a quite concise description of the distribution of the energy.

Learning of energies and data-driven evolutions
In this section we focus on the problem of the reconstruction of the energy function E , assuming that we have observed a certain large number N of evolutions x i : [0, T ] → R d , i = 1, . . . , N , obtained as limit of solutions x i ε of the singularly perturbed gradient flow (2.1) as ε → 0 . The energy reconstruction will be recast in terms of a minimum problem of a suitable discrepancy functional J η , very much alike the left-hand-side of (3.12). The functional depends explicitly on a measure η t ⊗ L 1 |[0,T ] , which is limit, along a subsequence, of the measures η N ε,t ⊗ L 1 |[0,T ] as N → ∞ and ε → 0 , where η N ε,t ∈ P(R d × R d ) is the empirical measure centered at the ε -evolutions. In what follows, we propose and analyze a constructive and numerically implementable procedure which allows us to approximate E in a finite dimensional setting up to an arbitrarily small error, in a suitable sense.

Learning as a variational problem
Following the lines of [7], we fix two constants M, R > 0 and we consider the function class Our particular choice of R, M is the following: , and every ε > 0 . In view of hypothesis (E1) of Section 2, we have that E ∈ C 1,1 loc ([0, T ] × R d ), so that to the given R corresponds M = M (R) such that E ∈ X M,R . We notice that R and M do not depend on ε .
Having in mind the numerical implementation, we are interested in computing good approxima-tionsÊ N of E belonging to suitable finite dimensional subset V N of X M,R , N ∈ N . In particular V N is a suitable ball of a finite dimensional subspace of W 2,∞ loc , e.g., a suitable finite element subspace. For the sequence (V N ) N ∈N we make the following uniform approximation assumption. We say that (V N ) N ∈N has the uniform approximation property with respect to η if for everyÊ ∈ X M,R there exists a sequenceÊ N ∈ V N such thatÊ N →Ê in W 1,∞ (supp(η)) as N → ∞.
Remark 4.2. The role played by the measure η will be better clarified in the following discussions. We anticipate here that, roughly speaking, the support ofη represents the region of R d × R d that has been explored by the observed evolutions. For this reasons, it is natural to assume the above uniform approximation property only on supp(η) . In the numerical simulations we will make an extensive use of this property, since we will employ suitable space refinements only on the regions of R d × R d visited by some evolution. We refer to Section 5.2 for further details.
We now introduce the key functionals for our reconstruction procedure. For every N ∈ N and every ε > 0 , let us fix N pairs (x i 0 , u i 0 ) ∈ supp(η 0 ) distributed according to η 0 and let us consider the corresponding solutions (x i ε , u i ε ) : [0, T ] → R d × R d of the ODE system (2.1). As in Section 3, we define the empirical measure η In Proposition 3.4 we have shown that in the limit as N → ∞ the sequence η N ε converges uniformly with respect to W 1 to a curve η ε ∈ C([0, T ]; P(R d × R d )) solution of the continuity equation (3.5).
Accordingly, we define the functional We notice that, with our choice of R and M , supp(η N ε,t ) ∪ supp(η ε,t ) ⊆ B R × B R for every ε , every N , and every t .
Finally, for a Borel family Notice that this functional is simply designed to naturally measure the discrepancy occurring in equation (3.12).
. . , N , be the solutions of the system (2.1). Assume that for i = 1, . . . , N x i ε converges to a quasi-static evolution x i in Γ p T for every 1 ≤ p < ∞ and u i ε converges to u i in Λ . Let us consider the empirical measure η N t : = 1 N N i=1 δ (xi(t),ui(t)) ∈ P(R d × R d ) and set The numerically implementable algorithm to approximate the energy E is based on the following finite dimensional optimization problem:Ê N := arg min E∈V N J N (Ê) (4.8) As (4.8) defines a sequence of variational problems, we wishes to show thatÊ N → E in a suitable sense, by using variational convergence arguments, such as Γ -convergence. As it is a standard notion, we refer to [8,13] for more details.

Approximation by Γ-convergence
Our construction is guided by the following (essentially commutative) diagram of limits: The following results clarify the limits appearing in the diagram. We start from the bottom of the diagram, showing the uniform convergence of J ηε to J η for ε → 0.
Let us now continue by describing the approximation provided by the upper limits of the diagram (4.9).
Proposition 4.5. Let δ > 0 and N ∈ N be given. Let ( T ×Λ, for i = 1, . . . , N , be the solutions of the system (2.1). Assume that for i = 1, . . . , N x i ε converges to a quasi-static evolution x i in Γ p T for every 1 ≤ p < ∞ and u i ε converges to u i in Λ . Let us consider the empirical measure η N Then, there exist ε N > 0 and two positive constants C 1 , C 2 (independent of δ and N ) such that for everyÊ ∈ X M,R and every 0 < ε ≤ ε N J N,ε (Ê) ≤ C 1 (J N (Ê) + δ + ε) and J N (Ê) ≤ C 2 (J N,ε (Ê) + δ + ε) . Remark 4.6. We notice that the hypothesis on the strong convergence of x i ε to x i is not too restrictive in view of the compactness result shown in Theorem 2.8. In fact, as a modeling assumption, we prescribe here that the observed quasi-static evolutions x i are limit of the singularly perturbed dynamic described by (2.1).
Proof of Proposition 4.5. In view of the convergence hypothesis on x i ε , there exists ε N > 0 such that In view of (c) of Proposition 2.6, we have for some positive constant C independent of N and ε . Thus, for everyÊ ∈ X M,R we have that Thanks to our choice of R , we have that Hence, for every t ∈ [0, T ] and every i = 1, . . . , N , In view of the previous estimates, assuming that 0 < ε ≤ ε N we continue in (4.12) with In a similar way we can show the second inequality in (4.11).
In the next proposition we show a uniform estimate of the distance between J N,ε and J η , which explains the central diagonal Γ -limit of (4.9). Proposition 4.7. Let {η t : t ∈ [0, T ]} be a Borel family in P(R d × R d ) with uniformly compact support and such that (4.6) is satisfied. Let η : (4.14) For I 1 , we have that As for I 2 we write where we have used the inequalityη(K) ≤ T . Proof. The Γ -liminf inequality follows directly from Proposition 4.7. In a similar way, the Γ -limsup inequality is a consequence of Proposition 4.7 and of Definition 4.1, which ensures that for everyÊ ∈ X M,R there exists a sequenceÊ N ∈ V N such thatÊ N →Ê in W 1,∞ (supp(η)), whereη is as in (4.2).
In the next two propositions we discuss the convergence of minimizers of the functionals J N,ε , J N to minimizers of J η . Proof. By definition of X M,R a sequence of minimizersÊ N of J N,ε N is weak * -compact in X M,R , and we denote withÊ the weak * -limit of a suitable subsequence ofÊ N . In view of Proposition 4.7, J N,ε N (Ê N ) converges to J η (Ê). From the minimality ofÊ N and the uniform approximation property satisfied by the subspaces V N , we can easily prove thatÊ is a minimizer of J η in X M,R .
Theorem 4.11. Let δ > 0 , ε N > 0, and η N ε N ,t , η N t ∈ P(R d × R d ) be as in Proposition 4.5. Assume that there exists a Borel family  Then, (Ê N ) N converges, up to a subsequence, to someÊ δ ∈ X M,R satisfying for a positive constant C independent of δ . Moreover, there exist a further Borel family {η t : t ∈ [0, T ]} ⊆ P(R d × R d ) and a furtherÊ ∈ X M,R such that, up to a subsequence, η δ converges narrowly to η : Proof. LetÊ N,ε N be a solution of min Pairing the inequalities (4.11) of Proposition 4.5 for J N,ε N and J N we get that (4.21) In particular, the constants C 1 and C 2 do not depend on δ and N . By definition of X M,R , the sequenceÊ N converges, up to a not relabeled subsequence, to someÊ ∈ W 2,∞ ([0, T ]×B R ) in the W 1,∞norm. Up to an extension, we may assumeÊ ∈ X M,R . By Proposition 4.7, J N,ε N (Ê N ) converges to J η δ (Ê) as N → ∞. In view of Proposition 4.10 we have that, up to a further subsequence,Ê N,ε N converges in W 1,∞ ([0, T ] × B R ) to a minimizer of J η δ . Hence, applying again Proposition 4.7, we deduce that J N,ε N (Ê N,ε N ) → 0 . Thus, applying Corollary 4.8 to (4.21) and taking into account the convergence ofÊ N toÊ , we deduce (4.19). The second part of the proposition follows immediately from (4.19), noticing that the measures η δ have uniform compact support in [0, T ] × R d × R d and are therefore compact with respect to narrow convergence.
Remark 4.12. Let us briefly comment on the result of Theorem 4.11. As we noticed in (4.5), by Proposition 3.7 we have that the measure η := η t ⊗ L 1 |[0,T ] is concentrated on the set Therefore, for everyÊ ∈ X M,R we can write Hence, equality (4.20) in Theorem 4.11 can be reformulated as This of course implies that ∇ xÊ (t, x) = ∇ x E(t, x)η -a.e. in [0, T ]×R d , that is, we are able to reconstruct the spatial gradient of the energy function E in the region of [0, T ] × R d that has been explored by the quasi-static evolutions x i (·) which, when the number of observed evolutions N is very large, is well approximated by the support of the measureη . This is indeed a natural constraint, since we have no information about the region of [0, T ] × R d that has not been explored by a quasi-static evolution. Even if the measureη is pretty much resulting from an abstract construction, since it has been obtained by applying a compactness argument to the sequence of measures η δ constructed in the first part of Theorem 4.11, we anyway claim that our approach to energy reconstruction is entirely constructive and numerically implementable. Let us briefly explain why. In the last part of Theorem 4.11 we have shown that η is the limit as δ → 0 of the sequence η δ := η δ t ⊗ L 1 |[0,T ] . The measure η δ satisfies This implies that ∇ xÊδ is itself a good approximation of ∇ x E in the space L 2 ([0, T ]×R d ,η δ ). Moreover, the first part of Theorem 4.11 gives us another important piece of information: η δ andÊ δ can be approximated in a "finite dimensional-finite number of evolutions" setting in which, indeed, we work only with a finite number N of observed evolutions, which completely determine the empirical measure η N ε N ,t , and we solve the minimum problem (4.18) on a suitable finite dimensional subspace V N of X M,R . Remark 4.13. We note that we could also obtain a result similar to Theorem 4.11 in nature, in which the roles of J N,ε N and J N are exchanged. Let δ > 0 and ε N be as in Proposition 4.5, and assume that the sequence of measures η N := η N t ⊗ L 1 |[0,T ] converges narrowly to η δ := η δ t ⊗ L 1 |[0,T ] and that (V N ) N ∈N satisfies the uniform approximation property with respect to η δ . Then, denoted withÊ N ∈ V N a solution of min we have that, up to a subsequence,Ê N converges to someÊ δ ∈ X M,R satisfying for a positive constant C independent of δ . Moreover, in the limit as δ → 0 , η δ converges narrowly to some η : The proof of this result is still based on the arguments of Propositions 4.5 and 4.7.
Such an approximation result will be used as a practical proxy for (4.18) in the numerical experiments in Section 5.2.

Data-driven evolutions of critical points
In the previous section we obtained compactness results, which explain the approximation of the energy E by data-driven energiesÊ ∈ W 2,∞ ([0, T ] × B R ) and, for δ > 0,Ê δ ∈ W 2,∞ ([0, T ] × B R ) constructed in Theorem 4.11. In this section we show pointwise error estimates on the singularly perturbed evolutions generated using the data-driven energiesÊ N ,Ê δ , andÊ , with respect to evolutions of critical points of the original energy E .
Corollary 4.14. LetÊ ,Ê N , and η be as in Theorem 4.11, and letη ∈ M + b ([0, T ] × R d ) be defined as in (4.2). Then, for every ε > 0 there exists N = N (ε) ∈ N large enough such that the solution (x N ε , u ε ) of the system (2.1) with initial condition (x 0 , u 0 ) and energyÊ N fulfills the error estimate where (x ε , u ε ) denotes the solution of (2.1) with initial condition (x 0 , u 0 ) and energy E and dist(·, K) is the usual distance from a set K ⊆ R d .
Remark 4.15. Let us comment on formula (4.22). Although the error estimate does not a priori guarantee the data-driven evolutionx N ε to be close to x ε , we anyway expect it to happen in most of the applications. Indeed, as shown in the numerical experiments in Section 5.2, increasing the number of observed evolutions results in the enlargement of supp(η). This means that even if, according to Theorem 2.8, the distance of x ε from a quasi-static evolution of critical points t → x(t) has no clear rate of convergence, the distance of x ε from the whole supp(η) , union of the orbits of all the observed quasi-static evolutions, can be expected to satisfy the condition dist(x ε (·), supp(η)) L 1 (0,T ) = 0 . In particular, we refer to Section 5.2 for some numerical examples of data-driven evolutions. Here, we conclude by noticing that (4.22)-(4.23) imply thatx ε −x ε tends to zero uniformly in [0, T ]. Therefore, we deduce from Theorem 2.8 that, along a suitable subsequence ε n → 0,x Nn εn → x in L p (0, T ) for p < +∞ , where N n = N (ε n ) . Finally, in the particular case dist(x ε (·), supp(η)) L 1 (0,T ) = 0 we even have that Proof of Corollary 4.14. LetÊ δ ∈ W 1,∞ ([0, T ] × B R ) be as in (4.19) of Theorem 4.11, so thatÊ δ →Ê as δ → 0 andÊ N →Ê δ for N → ∞ in W 1,∞ ([0, T ] × B R ). In particular, for every ε > 0 there exist δ > 0 small enough and N large enough such that where y : [0, 1] → R is a displacement map and E 0 is a suitable nonlinear elastic energy, fully determined by a potential function a : R → R . We consider here quasi-static evolutions y : [0, T ] × [0, 1] → R of critical points y(t) of E 0 subjected to time-dependent boundary conditions y(t, i) = f i (t) , for i ∈ {0, 1} , their mean-field descriptions, and the learning of the potential function a . As the theory we developed in this paper applies to finite dimensional states, see Section 1.1, we approach the problem in a space-discrete setting.
To simplify the notation, we introduce the discrete gradient operator D : R d+2 → R d+1 defined by With this notation at hand, we can rewrite E(t, x) as We also write explicitly the expression of ∇ x E(t, x): {i=1,...,d+1} . We notice that ker(D T ) = (1, . . . , 1) ⊆ R d+1 . We finally point out that for the control parameter u in (2.1) we will not consider any dynamics, that is, we fix f ≡ 0 in (1.5) and u(t) will be constantly equal to its initial value u 0 := ∇ x E(0, x 0 ) .
In order to apply the abstract scheme developed in the previous sections, we have first to check that the energy function (5.3) satisfies properties (E1)-(E4). In the following lemma we rigorously show that E fulfills (E1)-(E3), while in Remark 5.2 we discuss the generic validity of condition (E4). Proof. Property (E1) is clearly satisfied in view of (a1) and of the regularity of f 1 and f 2 .
By (a2) and by regularity of f 1 and f 2 , we have that Since a ≥ 0 , we can continue the previous estimate with for some positive constant C independent of t and x . Thus, (E2) holds.
As for (E3), by (a3) we have that By convexity of the function s → |s| p for p > 1 and by Young inequality we get A similar inequality holds for |f 2 (t) − x d | p . Hence, (5.5) becomes At this point, it is easy to see that there exists a positive constant c such that In fact, for j = 1 and j = d the inequality is obvious. For 1 < j < d we notice that This concludes the proof of (E3).
Remark 5.2. Let us comment on the validity of property (E4). In the framework described above, as we are assuming f = 0 in (2.1), it is actually enough to have that C(t, u) contains only isolated points for every u ∈ R d with u = ∇ x E(0, x 0 ), The validity of (5.6) is related to the so called transversality conditions (see, e.g., [2,30]). Indeed, in [2] the authors first show that the transversality conditions for an energy E imply that the set of critical points C(t):= {x ∈ R d : ∇ x E(t, x) = 0} contains only isolated points. In [2, Theorem 1.3] (see also [2,Corollary 3.7]) they also prove the genericity of the transversality conditions. In our setting, the latter result states that, assuming a ∈ C 4 (R) and f 1 , f 2 ∈ C 4 (0, T ) , there exists a set N ⊆ R d × R d×d of Lebesgue measure zero such that for every Ax · x satisfies the transversality conditions, so that the set {x ∈ R d : ∇ x E(t, x) = 0} contains only isolated points.
In the present work we have been considering the energy E(t, x, u) : = E(t, x) − u · x , which already modifies E by the additive linear term −u · x , where u: = ∇ x E(0, x 0 ) , x 0 being the initial condition of x(·) . Hence, assuming that the distribution of x 0 has a non-degenerate (say, for instance, of positive Lebesgue measure) support and ∇ x E(0, ·) is non-degenerate, we deduce that condition (5.6) is in general satisfied, up to a further generic quadratic perturbation of E .
In view of Lemma 5.1 and of Remark 5.2, from now on we will assume that E in (5.3) satisfies properties (E1)-(E4). Hence, we can apply the theoretical results of Section 4 to our energy E in (5.3). Since E is completely determined by the monovariate function a, we slightly modify the notation of Section 4 to this new setting, rewriting the identification problem in terms of a . In fact, while approximating a highdimensional (multivariate) function E directly incurs in the curse of dimensionality [24] in general, our model is actually parametrized by a lower dimensional function a , making the learnability/approximation problem computationally tractable. Of course, this imposes a further modeling constraint. Accordingly, for fixed M, R > 0 , instead of the space X M,R in (4.1), we consider The choice of M, R > 0 can be performed similarly to Section 4, simply noticing that the boundedness of x implies the boundedness of D e t (x) , with a bound that depends on the boundary data f 1 (t) and f 2 (t).
We consider a sequence (V N ) N ∈N of finite dimensional subspaces of A M,R , for which the uniform approximation property of Definition 4.1 reads now as follows. and let where π i : R d+1 → R stands for the projection on the i -th component. We say that (V N ) N ∈N has the uniform approximation property with respect to η if for everyâ ∈ A M,R there exists a sequenceâ We now rewrite the functionals (4.3)-(4.5) in terms ofâ, a ∈ A M,R making use of formula (5.4). As already mentioned, here we consider time independent controls u = ∇ x E(0, x 0 ) . Hence, given a distribution µ 0 ∈ P c (R d ) of initial conditions x 0 , the corresponding distribution of (x 0 , u 0 ) reads as η 0 := (id× ∇ x E(0, ·)) # µ 0 ∈ P c (R d ×R d ) . For every N ∈ N and every ε > 0 we fix N pairs (x i 0 , u i 0 ) ∈ supp(η 0 ) distributed according to η 0 and we consider the corresponding solutions (x i ε , u i ε ) : [0, T ] → R d × R d of the ODE system (2.1). Given the empirical measure η N ε,t : = 1 In the limit as N → ∞ the sequence η N ε converges uniformly with respect to W 1 to a curve η ε ∈ C([0, T ]; P(R d × R d )). Therefore, for everyâ ∈ A M,R we set For a Borel family {η t : t ∈ [0, T ]} ⊆ P(R d × R d ) and for everyâ ∈ A M,R we set then we can also express J η in the equivalent form We now adapt the main results of Section 4, namely, Proposition 4.7 and Theorem 4.11.
for some positive constants D 1 , D 2 depending on D , d , M , T , f 1 , and f 2 .
Following the lines of the proof of Proposition 4.7 and using (5.4), we get where D 1 = D 1 ( D , d, M, T ) > 0 .
As for I 2 we write This concludes the proof of the proposition.
Proof. It is enough to follow step by step the proof of Theorem 4.11 taking into account that the results of Proposition 4.5 still hold in the present framework.

Numerical results
In this section we present numerical experiments, which show the practical efficiency of the optimization (5.11) in recovering the potential function a from observation of a finite number of evolutions of critical points. In particular, we highlight some practical issues and the impact of various parameters of the problem on the reconstructions. First of all, we recast the problem in a discrete and numerically efficient implementation. Afterwards, we focus on how the available information -corresponding to the number of experiments or measurements per experiment -impacts the quality of reconstructions. Finally, we show that the choice of the constant M as in A M,R is in a sense generic, as sufficiently large M (for other parameters fixed) allows for appropriate reconstructions. Finally, we compare simulations of data-driven evolutions generated by the empiricalâ with those generated by the true potential a. We show the remarkable accordance of the results.

Efficient numerically implementable formulation
The following experiments are realized by a common numerical implementation and are applied to the toy mechanical example of Section 5. As space of competitors, we consider V Λ := â ∈ A M,R |â is piecewise quadratic on a given grid Λ .
(5. 16) We observe measurements at times 0 = t 0 < · · · < t Ne = T with stepsizes ∆ m = (t m+1 − t m−1 )/2 and gridpoints p 1 < · · · < p K of Λ with stepsizes∆ k = p k+1 − p k . For an appropriate increasing sequence of grids Λ : = Λ(N ), the corresponding sequence of spaces V N : = V Λ(N ) has the uniform approximation property on compact sets. We consider an initial data distribution µ N 0 drawn from a d -dimensional normal distributions with uniform standard deviation. For any initial data x i 0 in the support of µ N 0 , we solve the system (1.4) for trajectories x i ε for fixed ε > 0. As time-discrete approximation of the energy functional J N,ε in (5.9) we consider obtained by replacing the integral in time with a sum of point evaluations, which would correspond to assuming solutions, control, and boundary conditions to be piecewise constant in time. We assume that the arguments ofâ as in (5.17) are distributed according to a discrete version ofη in Definition 5.3, which encodes the available information to recover a . As previously stated, V N needs to be designed in order to approximate A M,R . In order to provide additionally a form of numerical stabilization and preconditioning, we choose the grid Λ adaptively with respect to the distributionη . In particular we consider denser meshes in regions of the support whereη has large density and coarser grids in regions of low density, thereby exploring the entire support ofη .
As functional defined in (5.17) solely depends on derivativesâ , we seek forâ ∈ V N which consists of piecewise linear functions such thatâ ∈ V N . In particular we consider the expansion where {φ λ : λ = 1, . . . , D(N )} is a set of suitable basis functions of V N , and a := (â 1 , . . . ,â D(N ) ) denotes the corresponding coefficient vector. From this information it is immediate by integration to identify a up to additive constants on connected components of the support ofη . Note, however, that it is not possible to relate additive constants at different connected components of the support ofη . In case the initial distribution has connected support, since the forward processes are continuous, also the support ofη consists of connected components (at most one for each time t m ). Thus, one can assume that for a sufficient amount of experiments and connected support of the initial distribution there are only few connected components. It should be noted thatJ N,ε (â) withâ written as above can be written as a quadratic functional Here the data vector Y corresponds to Therefore the assembly of M corresponds to formulation of the interpolation matrix B i,m and the "componentwise" application of D T and can be done iteratively. In particular, the system is sparse with at most 4 entries per row, and thus the approach can be applied even with a large number of measurements and experiments.
Minimizing the functionJ N,ε (â) over V N is a quadratic optimization problem. However, we require the ansatz to be conform, i.e., ensure the inclusion V N ⊂ A M,R . This constraint requires â W 2,∞ (I R ) ≤ M forâ ∈ V N . To enforce this constraint numerically, note that a and a are bounded by the maximal and minimal values of the corresponding coefficient vectors a and a as follows: One considers the gradient operator corresponding to the grid Λ , so that a = D Λ a is the coefficient vector of the piecewise constant function corresponding to a . Combining (5.19) and (5.21), we can consider as a discrete version of the reconstruction problem, Note that allowing two different bounds M 1 and M 2 offers more flexibility, while serving the same purpose in the theoretical setting of creating compactness. In particular this allows to target a more specifically by stricter bounds in order to avoid oscillatory behavior. Moreover, note that due to the structure of M, a can only be reconstructed up to a constant vector, meaningâ can only be determined up to constant analogously to the considerations leading to (5.15).
For the sake of simplicity we further restrict the optimization to competitors withâ (0) = 0 and we assume that a (0) = 0 as well.
Moreover, we notice that, because of discretizations in time inJ N,ε and the use of non equivalent constraints, the minimizer of (5.22) does not precisely coincide with the minimizer of the original minimization problem. It is however reasonable to think that it indeed approximates the true solution of J N,ε , which in turn approximates the true energy due to Γ -approximation.
Being (5.22) a least squares problem with norm constraints, a variety of optimization algorithms are applicable. For the results presented in this work we used the CVX toolbox [19,18], which is well suited as all functions and operations can be written as convex functions and constraints.

Linear elasticity -a trivial example
We start with the standard potential a(y) = y 2 2 , which is considered in the context of linear elasticity. As this potential is uniformly convex and contained in V N , one expects the reconstruction to work better compared to more complex potentials. Figure 1 depicts the approximation of a and a for a quadratic potential, and shows very accurate approximation. The approximation of a appears almost exact everywhere, while the approximation of a loses accuracy at the boundaries of the observed interval, due to summation of minor (systematic) errors. Nonetheless a is overall best approximated in regions whereη has higher density. Thus elastic potentials can be identified very well using this approach. However, for more complex potentials one may not expect to obtain always such a good reconstruction, and in the following sections we consider the impact of various parameters on the reconstruction of non-quadratic potentials.

Impact of the amount of information on the reconstruction quality
From a numerical perspective, solving (5.22) is a least squares problem, and with more available information, one would expect to increase the reliability of reconstructions. This amount of information in our setting mainly depends on two factors -the amount of measurements N e made for every experiment and the number N of observed experiments. Thus, we want to demonstrate the effect of increasing amount of information, in particular verifying that for sufficient amount of information one can accurately recover solutions, while too little information yields unstable recovery.
In Figure 2 all parameters of the reconstruction are fixed except for the amount of measurements N e . One observes that results get more reliable and noise is reduced with an increasing number N e of measurements and for a sufficient amount of information one can precisely reconstruct a on the support ofη , whose density is approximately depicted below by the histogram of observed information. In the reconstructions with little information, the solutions are oscillating, and, although following the overall trend of the true energy functional, do not perfectly capture its behavior. One can also see that in regions without information, and correspondingly with no or few nodes, the approximation is crude and can not be trusted, but in regions with much information a rather accurate reconstruction can be found. Of course, in many practical applications the amount of sampling in time might be limited by technical limitations. The resulting issue of lack of information can be offset by considering a larger number of experiments. In fact the main result of the provided theory is that for N → ∞ -considering ever more experiments -one can reconstruct a increasingly well. In order to approximate with V N the space A M,R in the sense of the uniform approximation property, it is necessary to increase adaptively the amount of nodes D(N ) . A trivial way to do this is considering a linear relation between N and the amount of nodes D(N ), i.e., D(N ) N . Figure 3 shows that an increased number of experiments and nodes improves significantly the quality of the reconstruction.
In comparison, we show that the improved reconstruction is not solely the result of a finer grid, but rather a consequence of more available information. Therefore, we considered in Figure 4 the same experiment as in Figure 3, but with the number of nodes D(N ) = 300 independent of N . While the reconstruction even for a single experiment is not particularly bad, one can see that it is not very smooth, representing a smaller degree of confidence in the solutions. For increasing number N of experiments, the results become smoother. Moreover, regions not significantly visited by a single experiment, and therefore not well supported by the mesh (e.g., left side of the plot), get better represented for a larger number of experiments as they might get explored more thoroughly. However, note that there appear to be regions which does not get -or does get very rarely -visited independently of the number N of experiments as the density ofη is zero or very small there, and therefore no reasonable reconstruction can be obtained at such locations.

Suitable W 2,∞ constraints
The theory in this work does not yet provide a method for choosing M (or M 1 , M 2 in our numerical model). It is clear that M 1 and M 2 too small will significantly limit the available class of competitors, and therefore one can not expect to capture the true a if M 1 and M 2 are much smaller than a ∞ and a ∞ , respectively. On the other hand, M 1 and M 2 finite is necessary to ensure compactness from a theoretical perspective, so it is not obvious what the impact of too large M 1 and M 2 is. However, for suitable data, one would expect that for M 1 , M 2 >M sufficiently large have no real impact on the reconstruction. Figure 5 depicts the effect of different constraints M 2 on a ∞ . One can see that for too small M 2 the reconstruction follows the overall trend, but can not replicate local fluctuations, with more detail captured by increasing M 2 . In particular, note that there is no difference between solving with M 2 = 20 and M 2 = 1000 since a ∞ ≤ 20 , where a is the derivative of the true solution of the minimization problem (5.22) without constraints and therefore the constraint has no effect. We further stress that this does not imply that the constraint M 2 is irrelevant, as for poor or incomplete data the least squares problem can become highly unstable (e.g., due to overfitting), and constraints can limit this effect.
On the other hand, the constraint M 1 bounds the overall values of a . For too small M 1 the reconstruction corresponds to a projection of the true energy function to the corresponding bound. This effect can be observed in Figure 5.

Data-driven evolutions
Given a, ε, x 0 , u 0 , and f , the system (1.5) can be solved to generate the evolution of x ε . While we created or observed evolutions generated by the true a and used these trajectories in the previous sections to identify a, a practical reason for determiningâ ≈ a is that this in turn can be used for simulations of system (1.5), e.g. instead of performing further real-life experiments. This section discusses the quality of such numerical simulation showing that indeed suitable evolutions can be replicated, as theoretically analyzed in Section 4.3.
We start by considering the situation with linear elastic potential a(y) = 1 2 y 2 discussed in Section 5.2.2, where high fidelity approximation of a byâ is achieved. The left side of Figure 7 depicts the corresponding trajectoriesx ε and x ε generated byâ and a, respectively, with an initial datum (x 0 , u 0 ) taken from the distribution η 0 . These trajectories are basically identical, which is to be expected in view of the good reconstructionâ of a , and since we chose the initial data from µ 0 , and the correspondingη is supported on a sufficiently large domain.
When considering the more challenging example with highly nonlinear potentials of Section 5.2.3 for N e = 55 , N = 60 and D(N ) = 4N , the recovery is slightly less precise, in particular since the support ofη is no longer connected in this case. On the right of Figure 7 we show that indeed in this situation we can not perfectly replicate the evolutions, nonetheless the overall behavior of the trajectory remains intelligible.
The  are distant from the support ofη . Thereâ is not reliable creating further errors. However, the resulting trajectories appear quite acceptable and in particular they show no extreme outliers where values may diverge or act too wildly.
In summary, the presented simulations confirm the theoretical findings about the robust recovery of various potentials a or a from observations of evolutions of critical points. The reconstructionsâ are such that further simulations of trajectories are faithful.