Probabilistic Learning on Manifolds

This paper presents mathematical results in support of the methodology of the probabilistic learning on manifolds (PLoM) recently introduced by the authors, which has been used with success for analyzing complex engineering systems. The PLoM considers a given initial dataset constituted of a small number of points given in an Euclidean space, which are interpreted as independent realizations of a vector-valued random variable for which its non-Gaussian probability measure is unknown but is, \textit{a priori}, concentrated in an unknown subset of the Euclidean space. The objective is to construct a learned dataset constituted of additional realizations that allow the evaluation of converged statistics. A transport of the probability measure estimated with the initial dataset is done through a linear transformation constructed using a reduced-order diffusion-maps basis. In this paper, it is proven that this transported measure is a marginal distribution of the invariant measure of a reduced-order It\^o stochastic differential equation that corresponds to a dissipative Hamiltonian dynamical system. This construction allows for preserving the concentration of the probability measure. This property is shown by analyzing a distance between the random matrix constructed with the PLoM and the matrix representing the initial dataset, as a function of the dimension of the basis. It is further proven that this distance has a minimum for a dimension of the reduced-order diffusion-maps basis that is strictly smaller than the number of points in the initial dataset. Finally, a brief numerical application illustrates the mathematical results.


Introduction
In this paper, mathematical results are presented for justifying and clarifying the methodology of probabilistic learning on manifolds (PLoM), initially introduced in [1], completed in [2], extended to the case of the polynomial chaos representation [3], to the case of the learning in presence of physics constraints [4], and to the sampling of Bayesian posteriors [5]. The PLoM has been used with success for the probabilistic nonconvex optimization under constraints and uncertainties [6] that has allowed for analyzing very complex engineering systems such as the hypersonic combustion flows [7,8], the optimal placement of wells [9], the design optimization under uncertainties of mesoscale implants [10], the quantification of model uncertainties in nonlinear computational dynamics [11], the fracture paths in random composites [12].
The proposed PLoM, which can be viewed either as a supervised or an unsupervised machine learning, considers a given initial dataset constituted of N given points η 1 d , . . . , η N d in R ν , which are interpreted as independent realizations of a R ν -valued random variable H for which its non-Gaussian probability measure p H (η) dη on R ν is unknown but is, a priori, concentrated in an unknown subset of R ν . Denoting by p H (η) dη} N on R ν is convergent to p H (η) dη for N → +∞. In the PLoM, dimension N is fixed and is presumed to be relatively small (case for which only small data are available in opposite to the big-data case). Nevertheless, it is assumed that N is larger than some lower bound N 0 needed for to be a sufficiently accurate estimate of p H (η) dη. Let us now define the random vector H (N ) such that its probability measure is p It should be noted that, for m = N, the value d 2 N (N) represents the distance of the random matrix [H N N ] to the initial dataset [η d ], for which the learned dataset would be generated without using the PLoM and consequently, would involve a scattering of the generated realizations corresponding to a loss of concentration.
In the formulation proposed and analyzed, N is fixed. There is a priori no sense in studying the convergence of the distance for N going towards infinity, on the one hand because N has a limited value that is supposed to be rather small and on the other hand because N also represents the number of columns of the random matrix [H N ]. However, for a fixed small value of N, the convergence of the probabilistic learning with respect to N can be considered by introducing an ordered subset of integers N 1 < N 2 < . . . < N imax with N 1 > 1 and N imax = N, and by studying the convergence as a function of N i for i = 1, . . . , i max . If the convergence is reached for i ≤ i max , then the learning process is successful; if not, this means that the value of N is too small and has to be increased, that is to say, by increasing the number of points in the initial dataset. This question is outside the scope of this paper and we refer the reader to the references given in the first paragraph of this introduction, references in which this question is dealt with.

Framework and objective of the PLoM
Probabilistic learning is a way for improving the knowledge that one has from only a small number of expensive evaluations of a computational model in order to be able to solve, for instance, a nonconvex optimization problem under uncertainties with nonlinear constraints, for which a large number of expensive evaluations would be required. To that end, statistical and probabilistic learning methods have been extensively developed and play an increasingly important role in computational physics and engineering science. In large scale model-driven design optimization under uncertainty, and more generally, in artificial intelligence for extracting information from big data, statistical learning methods have been developed in the form of surrogate models that can be evaluated such as, Gaussian process surrogate models, Bayesian calibration methods, active learning, and when large databases are available, neural networks, generative adversarial networks, etc.. All these methodologies allow for decreasing the numerical cost of the evaluations of expensive functions. This is particularly crucial for the evaluation of statistical estimates of large scale stochastic computational models. This is a major challenge that requires the use of suitable mathematical methods and algorithms. The probabilistic learning on manifold, which is analyzed in this paper, is a contribution towards addressing this challenge.
In the framework of supervised machine learning, a typical problem for the use of the PLoM is the following. Let (w, u) → f(w, u) be any measurable mapping on R nw × R nu with values in R nq representing a computational model coming, for instance, from the discretization of a boundary value problem, in which n w , n u , and n q are any finite integers. Let W and U be two independent (non-Gaussian) random variables defined on a probability space (Θ, T , P) with values in R nw and R nu , for which the probability measures P W (dw) = p W (w) dw and P U (du) = p U (u) du are defined by the probability density functions p W and p U with respect to the Lebesgue measures dw and du on R nw and R nu . Random vector W is made up of a part of the random parameters of the computational model, which are used for controlling the system, while random vector U is made up of the other part of these random parameters, which are not used for controlling the system. Let Q be the quantities of interest (QoI) that is a random variable defined on (Θ, T , P) with values in R nq such that Let us assume that N calculations have been performed with the computational model (the training) whose solution is represented by equation (1), allowing N independent realizations {q j , j = 1, . . . , N} of Q to be computed such that q j = f(w j , u j ), in which {w j , j = 1, . . . , N} and {u j , j = 1, . . . , N} are N independent realizations of (W, U), which have been generated using an adapted generator for p W and p U . We then consider the random variable X with values in R n , such that X = (Q, W) , n = n q + n w .
The probabilistic learning is performed for X. The initial dataset D N related to random vector X is then made up of the N independent realizations {x j , j = 1, . . . , N} in which x j = (q j , w j ) ∈ R n . In this paper, it is assumed that the measurable mapping f is such that the non-Gaussian probability measure P X (dx) of X = (Q, W) admits a density p X (x) with respect to the Lebesgue measure dx on R n . The probability measure of X is unknown and is assumed to be concentrated in a subset of R n that is also unknown (this concentration property is due to equations (1) and (2)). The objective of the PLoM proposed in [1] is to construct a probabilistic model of non-Gaussian random vector X using only initial dataset D N , which allows for generating ν sim ≫ N additional independent realizations {x 1 ar , . . . , x ν sim ar } in R n of random vector X, preserving the concentration of its probability measure and without using the computational model. It can then be deduced ν sim additional realizations {(q ℓ ar , w ℓ ar ), ℓ = 1, . . . , ν sim } that are such that (q ℓ ar , w ℓ ar ) = x ℓ ar . These additional realizations allow, for instance, a cost function J(w) = E{J(Q, W)|W = w} to be evaluated, in which (q, w) → J(q, w) is a given measurable real-valued mapping on R nq × R nw as well as constraints related to a nonconvex optimization problem [6,9,10,7] and this, without calling the computational model. The objective of the paper is to present mathematical developments of the PLoM presented in [1], which allow for justifying the methodology and providing foundation for further developments. The main steps of the PLoM methodology are not summarized but will be recalled as we will establish the mathematical properties.

Organization of the paper and what are the main results
In Section 2, we introduce the R ν -valued random variable H resulting from the principal component analysis (PCA) of the R n -valued random variable X with ν ≤ n. Section 3 is devoted to the nonparametric statistical estimate p In Section 5, we present the construction of the reduced-order diffusion-maps basis [g m ] that is used by the PLoM method and we introduce the estimation of the optimal values ε opt and m opt of the hyperparameter ε DM and of the reduced order m. In particular, we compare the hyperparameter ε DM and the modified Silverman bandwidth s; we conclude that the invariant probability measure p εDM (i) of the Markov chain allowing the diffusion-maps basis to be constructed is different from the probability measure p (N ) H (η) dη of random vector H (N ) that is considered by the PLoM. Section 6 is devoted to the construction of the probability measure and its generator related to the probabilistic learning on manifolds. We introduce the reduced-order representation [ . We also prove in Proposition 2 that p [Zm] has a "Gaussian representation", which is a linear combination of N N products of ν Gaussian pdf on R N . Consequently, the use of Theorem 3 effectively allows realizations of random matrix [Z m ] to be generated, while a Gaussian generator that would be based on the Gaussian representation is unthinkable for N > 10, for instance. Section 7 deals with the square of the relative distance d 2  Lemma 2), and induces a loss of concentration of the probability measure. Under a "reasonable hypothesis", the second central Theorem 4 proves that d 2 N (m opt ) ≪ d 2 N (N) in which m opt < N is the optimal value of m. This result demonstrates that the PLoM method is a better method than the usual one because it keeps the concentration of the measure. In Section 8, we present a justification of the hypothesis introduced in Theorem 4, based on the use of the maximum entropy principle from Information Theory. Section 9 is devoted to a brief numerical application that illustrates the mathematical results. The conclusions follow in Section 10.

Notations
The following notations are used: x: lower-case Latin of Greek letters are deterministic real variables.
x: boldface lower-case Latin of Greek letters are deterministic vectors. X: upper-case Latin or Greek letters are real-valued random variables. X: boldface upper-case Latin or Greek letters are vector-valued random variables.
[x]: lower-case Latin of Greek letters between brackets are deterministic matrices.
[X]: boldface upper-case letters between brackets are matrix-valued random variables. N, R: set of all the integers {0, 1, 2, . . .}, set of all the real numbers. R n : Euclidean vector space on R of dimension n. x = (x 1 , . . . , x n ): point in R n . < x, y >= x 1 y 1 + . . . + x n y n : inner product in R n .
x : norm in R n such that x 2 =< x, x >.

De-correlation and normalization of random vector X by PCA
In practice, initial dataset D N results from a scaling of the available data. Then, a principal component analysis (PCA) is carried out in order to statistically condition the scaled initial dataset D N through de-correlation and normalization. Let x ∈ R n and [ C] ∈ M +0 n be the classical empirical estimates of the mean vector and the covariance matrix of X, constructed using D N (scaled). Let [ µ] ∈ M ν be the diagonal matrix of the first ν eigenvalues µ 1 ≥ µ 2 ≥ . . . ≥ µ ν > 0 of [ C] and let [ Φ] ∈ M n,ν be the matrix of the associated orthonormal eigenvectors. For any ε > 0 fixed, ν ≤ n is chosen such that err PCA (ν) = 1 − ( µ 1 + . . . + µ ν )/(Tr [ C]) ≤ ε. This PCA allows for representing X by Throughout this paper, it will be assumed that ν < N. The N independent realizations {η j d , j = 1, . . . , N} of the second-order R ν -valued random variable H defined on probability space (Θ, T , P) are such that Let It can be seen that the Frobenius norm η d of matrix [η d ] ∈ M ν,N is such that 3. Nonparametric estimate of the pdf of H As proposed in [13,1], the modification of the multidimensional Gaussian kernel-density estimation method [14,15,16,17] is used for constructing the estimation p (N ) H on R ν of the pdf p H of random vector H, which is written, ∀η ∈ R ν , as in which s is the usual Silverman bandwidth (since [C N ] = [I ν ]) (see for instance, [18]) and where s has been introduced in order that R ν η p (N ) , because, in the framework of the PLoM, we need to preserve the centering and the orthogonality property. Finally, for fixed ν, Theorem 1 (Sequence of estimators of p H (η) [19]). Let us assume that p H is continuous on R ν . For ν fixed and for η given in is a positive-valued random variable where H 1 , . . . , H N are N independent copies of H. Thus, ∀η ∈ R n , the mean value P (N ) (η) = E{P (N ) (η)} and the variance Var{P (N ) (η)} = E{(P (N ) (η) − P (N ) (η)) 2 } of P (N ) (η) are such that lim N →+∞ P (N ) (η) = p H (η) and Var{P (N ) (η)} ≤ N −4/(ν+4) β ν,N P (N ) (η), in which, for ν fixed and for N → +∞, the positive constant β ν,N is such that β ν,N ∼ (2π) −ν/2 ((2 + ν)/4) ν/(ν+4) .
PROOF. The proof, inspired from [19], is adapted to the modification s = s used for defining the estimator. Since p H is assumed to be a continuous function, ∀η ∈ R ν , p H (η) = E{δ 0ν (H − η)}. Using the second equation (7), for all η and η in R ν , we have π ν,N ( s Using equation (9) yields the following equality in the space of the bounded measures on Using the two last above equations yields the expression for the mean. Similarly, E{(P (N ) (η) . From equation (8) and the last inequality yield the expression for the variance in which β ν, that is the expression given in the theorem for N sufficiently large.

Definition of the random matrix [H N ] and its pdf
The introduction of a (ν×N) random matrix [H N ] will allow the initial dataset to be represented using the diffusion-maps basis.
and let d[η] = ⊗ N ℓ=1 dη ℓ be the measure on M ν,N induced by the Lebesgue measures dη 1 , . . . , dη N on R ν . Let [η d ] ∈ M ν,N be the matrix constructed using the N points η j ∈ R ν defined by equation (4), Finally, we will use the following notation, Note that matrix [η d ] defined by equation (11) has to carefully be distinguished from matrix [η d (j)] defined by equation (12). Nevertheless, it can be seen that for defined by equations (7) and (8). We then define the random matrix Note that in Definition 2, H 1 , . . . , H N are not taken as N independent copies of H whose pdf p H is unknown, but are taken as N independent copies of H (N ) whose pdf p PROOF. Using Definition 2 yields, H (η ℓ ) and using equation (7) yields equation (13).

Construction of a reduced-order diffusion-maps basis
To identify the subset around which the initial data are concentrated, the PLoM method [1,2] relies on the diffusion-maps method [20,21,22,23]. We use the Gaussian kernel such that, for all η and is the transition matrix of a Markov chain that yields the probability of transition in one step.

Diffusion-maps basis as a non orthogonal vector basis in R N
The eigenvalues λ 1 , . . . , λ N and the associated eigenvectors ψ 1 , . . . , ψ N of the right-eigenvalue problem [P] ψ α = λ α ψ α are such that 1 = λ 1 > λ 2 ≥ . . . ≥ λ N and can be computed by solving the generalized eigenvalue problem The eigenvector ψ 1 associated with λ 1 = 1 is a constant vector that can be written as For a given integer m with 2 < m ≤ N, we define the reduced-order diffusion-maps basis of order m as the family {g 1 , . . . , g m } that we represent by the matrix Note that {g α } α is not orthogonal for the inner product < ·, · >, but is orthogonal for the one defined by It can also be seen that the construction of the reduced-order diffusion-maps basis [g m ] depends, a priori, on three parameters: the smoothing parameter ε DM , the order m, and the integer κ.
Nevertheless, we will see in (5.4) that κ has not a role from a theoretical point of view in the proposed method, in contrary to the one used in [20]. In the PLoM, its role is the one of an additional scaling; its value can therefore be fixed arbitrarily (for instance, it can be set to 1 or even to 0; in the latter case, we have g α = ψ α ). As a result, the only two parameters that will be considered will be ε DM and m.

Estimation of the optimal values
Under Hypothesis 1, an algorithm associated with the given initial dataset [η d ] has been proposed in [2] for estimating the optimal value ε opt of ε DM and an optimal value m opt of order m. Most of time, ε opt and m opt can be estimated as follows. Let If function m is a decreasing function of ε DM in the broad sense (if not, the general method based on Information Theory proposed in [2] should be used), then the optimal value ε opt of ε DM can be chosen as the smallest value of the integer m(ε opt ) such that The corresponding optimal value m opt of m is then given by m opt = m(ε opt ) and for such an optimal choice, we have seen through numerical experiments that Note that such an algorithm has been used with success for many databases in engineering sciences (see [6,9,10,24,7,8,11,4,5]).

On the relationship between hyperparameter ε DM and the modified Silverman
bandwidth s The invariant measure associated with transition matrix [P] of the one-step Markov chain is p εDM H (η) is defined by equations (7) and (8), which is written, for N sufficiently large (that is to say for s/s ∼ 1 and s ∼ s), as p In general, for ν sufficiently large (for instance, ν ∼ 10), the optimal value ε opt defined by equations (14) and (15) is such that ε opt ≫ 1 while, since ν ≤ N, equation (8) shows that s 2 /2 < 1. Therefore, p εDM (i) is very different from the probability measure p (N ) H (η) dη that corresponds to an observation of the initial dataset from inside it, that is to say, for an observation at the smallest scale. In contrast, the probability measure p εDM (i) is the one for which the initial dataset is observed from outside it, that is to say, for an observation at a larger scale.

Properties of the reduced-order diffusion-maps basis
It should be noted that, as announced at the end of (5.1), matrix [G m ] is independent of λ κ 1 , . . . , λ κ m and thus, is independent of κ.
PROOF. The proof is left to the reader.

Probabilistic learning on manifolds (PLoM): construction of the probability measure and its generator
The three main steps of the PLoM introduced in [1] are the following. 1) Construction of a MCMC generator for random matrix [H N ] defined in Definition 2, based on a nonlinear Itô stochastic differential equation (ISDE) that will be introduced in ( . This MCMC generator is adapted to perform its projection on the subspace spanned by the reduced-order diffusion-maps basis and in addition, a dissipative term allows the transient part of the response to be rapidly killed. This MCMC generator belongs to the class of Hamiltonian Monte Carlo methods [26,27], which is an MCMC algorithm [28,29,30]. For k = 1, . . . , ν and ℓ = 1, . . . , N, and for  Let L 0 (Θ, R N ) be the vector space of all the random variables, defined on (Θ, T , P), with values in R N . It can be seen that each R N -valued random variable H k belongs to the subspace L 0 (Θ, E m ) ⊂ L 0 (Θ, R N ) in which E m ⊂ R N is the subspace of R N spanned by {g 1 , . . . g m }. Note that, contrarily to the PCA that is a reduction following the physical coordinates axis, the representation constructed with the reduced-order diffusion-maps basis is a reduction following the data axis.
The positive parameter c νm is the constant of normalization, [z] = [z 1 . . . z m ] ∈ M ν,m with z α = (z α 1 , . . . , z α ν ) ∈ R ν and with z α k = [z] kα , and g α ℓ is given by  (i) Proof of equation (19). For m fixed, since the reduced representation of random matrix [H N ] (for which its pdf p [H N ] is given by equation (13) (19), is the invariant measure of equations (17a) and (17b). For proving that, there are several possibilities. We chose to use an algebraicbased demonstration, which allows for introducing notations that will be reused in Proposition 2. For simplifying the writing, the Itô equation (17a)-(17b) is rewritten as the following second-order stochastic differential equation that has to be read as an equality of generalized stochastic processes (see for instance, Chapter XI of [32]), in which D r [W] is the generalized normalized Gaussian white process resulting from the generalized derivative with respect to r of the M ν,N -valued Wiener stochastic process defined in Notation 1. For k = 1, . . . , ν, for α = 1, . . . , m, and for ℓ = 1, . . . , N, we define z α = (z α 1 , . . . , z α ν ) ∈ R ν and z k = ( z k 1 , . . . , z k m ) ∈ R m with z α k = z k α = [z] kα . Similarly, we define the real functions (z 1 , . . . , Using the mathematical results given in Chapter XIII of [31], it can be deduced that the ISDE corresponding to the previous ν coupled generalized stochastic equations admits a unique invariant measure on (Π ν k=1 R m ) × (Π ν k=1 R m ), defined by the following density with respect to (⊗ ν k=1 d z k )⊗(⊗ ν k=1 d y k ), which is p( z 1 , . . . , z ν ; y 1 , . . . , y ν ) = c 2 νm exp{− 1 2 ν k=1 < [g m ] T [g m ] y k , y k > − Φ( z 1 , . . . , z ν )}. Consequently, the joint pdf of the R νvalued random variables Z 1 , . . . , Z m with respect to ⊗ m α=1 dz α is given, using the third equation (21), by p Z 1 ,...,Z m (z 1 , . . . , z m ) = c νm exp{−Φ(z 1 , . . . , z m )} and thus, using equation (21), the pdf of random matrix Using the expression of V(u ℓ ) defined in Theorem 2 and introducing c νm = c νm /N N , this pdf can be rewritten as equation (19). (iii) Proof of uniqueness of an asymptotic stationary and ergodic solution of equations (17a) to (17c). The use of Theorem 9 in Page 216 of [31] yields the proof that equations (17a) to (17c) has a unique solution {([Z(r)], [Y(r)]), r ≥ 0} that is a second-order diffusion stochastic process, which is asymptotic, for r → +∞, to a unique stationary stochastic process ([Z st ], [Y st ]) having the properties given in Theorem 3. The ergodicity of the stationary solution is directly deduced from [33] or from [34].
in which for all j in J (see Definition 1), in which [η d (j)] ∈ M ν,N is defined by equation (12). For all k in {1, . . . , ν}, p Z k (·; j) is the Gaussian pdf, such that, for all z k in R m , . Combining the two previous equations allows p [Zm] ([z]) to be rewritten as p [Zm] (22) can then be deduced in which p j (m) is given by equation (2)). (ii) From Lemma 1-(v), and using equation (24), it can be seen that a j (m) ≥ 0. The end of the proof is easy to do.

Remark 3 (About the algebraic representation of pdf p [Zm] and its generator).
(i) Equation (22) shows that pdf p [Zm] on M ν,m is a linear combination of N N products of ν Gaussian pdf on R N . Consequently, the use of the reduced-order ISDE given by Theorem 3 effectively allows realizations of random matrix [Z m ] to be generated, while a Gaussian generator that would be based on the representation given by equation (22) is unthinkable. (ii) The generation of n MC ≫ 1 independent realizations {[z ℓ ], ℓ = 1, . . . , n MC } of random matrix [Z m ] is performed by using the MCMC generator defined by Theorem 3 in which the reduced-order stochastic equations (17a) to (17c) are solved using the Störmer-Verlet scheme [35,36], which is well adapted to stochastic Hamiltonian dynamical systems and which is detailed in [1]. We can then deduce the learned dataset Definition 6 (Square of the relative distance d 2 . For m fixed, the square of the relative distance of random matrix The following Lemma gives the value of d 2 N (N), which corresponds to the value of the distance if the PLoM method is not used (m = N). In this case, the MCMC generator of random matrix [H N N ] is given by Theorem 2.  (6) yields . The result is obtained using (i), (ii), and equation (6).
In order to apply Proposition 3 for the case m = N, we need the results given in the following lemma. (24) and (27b), we have
In Hypothesis 1, based on the properties of {λ α (ε DM )} α , we have introduced the existence of optimal values ε opt and m opt of ε DM and m, respectively. In the following hypothesis, we introduce the connection between m → ε d (m) and {λ α (ε DM )} α .
Hypothesis 2 (Relative to m → ε d (m)). Under Hypothesis 1, it is assumed that m opt is such that Remark 4 (Comments about hypotheses 1 and 2 related to m opt ). Regarding Hypothesis 1 devoted to the existence of the optimal values ε opt of ε DM and m opt of m, in which, ∀ j ∈ J , p j (m) is defined by equation (23a) as a function of γ j (m) (defined by equation (23b)) that can be rewritten as where [η m d (j)] ∈ M ν,N is defined by with [η d (j)] ∈ M ν,N defined by equation (12). (24), it can be seen that
Lemma 5 (Rewriting function h d ). For all m such that 1 ≤ m ≤ N and for all j in J , let g j (m), g(m), γ(m), and r(m) be defined by in which γ j (m) is given by equation (33), and in which p j (m) is defined by equation (23a). Using these definitions, we have in which ε d (m) is defined by equation (30), and γ(m) > 0 , γ(N) = 1 , r(m) > 0 , r(N) = 1 .
PROOF. Using equation (35) that can be rewritten, using (28b) and Lemma 1- Substituting η m d 2 given by equation (31) in the above expression of g(m) and using equation (28a) yield equation (37). Due to equation (35), g(m) > 0 and since ε d (N) = 0 (see Lemma 4-(i)), the first equation (37) with m = N yields the third equation (37). Since then function m → f d (m) has a unique local minimum that is a global minimum, which is reached for m = m opt ,   2 and equation (40) shows that ∆ − m < 0. (iii) Since ∆ − m < 0 for 1 ≤ m ≤ m opt and ∆ + m > 0 for m opt ≤ m ≤ N yield equation (41).
in which d 2,sup N (m) is written as Equations (42a) and (42b) shows that which means that, if hypothesis defined by equation (42a) holds, then the PLoM method is a better method than the usual one corresponding to d 2 N (N).
PROOF. Equations (32a) and (39) (36) does not seem to be calculable either explicitly or numerically (since there are N N elements in set J ). This is the reason why we have introduced the hypothesis defined by equation (42a) in order to formulate the theorem. Obviously, this hypothesis has numerically been verified by a direct Monte Carlo simulation of d 2 N (m) given in Definition 6 using (6.3) and Remark 3. In (8), we give additional developments and comments about this hypothesis.

Justification of the hypothesis introduced in Theorem 4
As explained in Remark 5, r(m) defined by equation (36) cannot explicitly be calculated for 1 ≤ m ≤ N − 1 (for m = N, we have r(N) = 1). In this section, we give a preliminary remark showing the difficulty. We then propose an estimation of r(m) using the maximum entropy principle from Information Theory, and finally, we propose a rough approximation of r(m). In (9), devoted to a numerical illustration, we will compare the two last estimations of d 2 N (m) with the "true" function d 2 N (m) estimated as explained in Remark 5.
Remark 6 (Preliminary remark). For all j in J and for all m in {1, . . . , N}, let a j (m) ≥ 0 be defined in Proposition 2-(ii). Therefore, γ j (m), which is defined by equation (23b), can be rewritten as γ j (m) = exp{− 1 2s 2 a j (m)} and consequently, p j (m) defined by equation (23a), can also be rewritten as The discrete random variable A(m) with values in {a j (m), j ∈ J }, whose probability distribution {p j (m), j ∈ J } is defined by equation (46), is a Maxwell-Boltzmann distribution. Its mean value is a(m) = E{A(m)} = j∈J a j (m) p j (m), and its entropy is written as S({p j (m)} j ) = − j∈J p j (m) log p j (m) = 1 2s 2 a(m)+ log( j∈J exp{− 1 2s 2 a j (m)}). From equation (36), we then have to calculate, for all m ∈ {1, . . . , N − 1}, r(m) = j∈J (g j (m)/g(m)) p j (m), in which g j (m) is defined by equation (35). As we have explained, such a calculation cannot be performed neither explicitly nor numerically (there are N N elements in J ).
In the following, we construct an estimation of r(m) using the maximum entropy principle.

Definition 8 (Discrete random matrix [A]). Let [A] be the discrete random variable with values in
] ∈ M ν,N defined by equation (12), and for which the probability distribution is { p j , j ∈ J } with p j = 1/N N , Lemma 7 (Second-order moments of random matrix [A]).
Using equations (28a) and (24) (39) in which r(m) is defined by equation (36) can be rewritten as PROOF. For all j in J , g j (m) (defined by equation (35)) can be rewritten, using the proof of Proposition 4 and equation (27b), as follows, Substituting this expression of g j (m) into equation (36) and using equation (35) yield equation (49a) in which γ(m) = j∈J p j exp(− 1 Below, an approximation [A c ] of random matrix [A] is constructed using the maximum entropy principle [37,38,39,40,41] PROOF. The proof is left to the reader (see for instance, [41]).  is larger than the one for the probability measure P [A] . For instance, in (9)    PROOF. Let f be a mapping on M ν,N such that the following quantity be de- Using equation (51), it can be seen that L f (m) = ( The third equation (54) is then directly deduced.

Numerical illustration
The numerical illustration proposed is the application (AP1) in Section 10 of reference [5]. For reasons of limitation of the paper length, we cannot reproduce the description of this application and we refer the reader to the given reference. With respect to the notation introduced in (1.1), we have n w = 20, n q = 200, n = 220, and N = 200. For the PCA (see (2)) and for ε = 10 −6 in the second equation (3), we have ν = 9. Consequently, err PCA (ν) ≤ 10 −6 . Concerning the nonparametric estimate (see (3) (14) and (15) yields ε opt = 60 and m opt = 10. Parameter κ has been fixed to 1. The graph of function α → log(λ α (ε opt )) (see (5.1)) is displayed in (2) (left) and the graph of function m → ε d (m) defined by equation (30) is shown in (2) (right). In order to better visualize these graphs, a zoom has been done for the abscissa (α ≤ 50 and m ≤ 50 instead of the upper bound N = 200). It can be seen that these graphs are similar to the ones shown in (1) and that Hypotheses 1 and 2 are well verified. The graph of function m → f d (m) defined by equation (32b) is displayed in (3) (left) and the graph of function m → g(m) defined by equation (37) (6). It has been verified that the L 2 -convergence is obtained for this value of n MC . Left (4) shows the graph of function m → d 2,sim N (m). It can be seen that the local minimum is a global minimum obtained for m = m opt as expected and that d 2,sim N (N) ≃ 2 (in agreement with Lemma 2). Right (4) shows three curves: again the graph

Conclusions
In this paper, we have presented mathematical results that justify, highlight, and better explain the probabilistic learning on manifolds proposed in [1]. We have formulated and proven several results, which show that the PLoM methodology is efficient for probabilistic learning as it has been demonstrated in the framework of applications performed for complex engineering systems. The distance introduced for the mathematical analysis of the concentration properties of the probability measure could be used to estimate the optimal dimension of the reduced-order diffusion maps basis and thus to replace the algorithm previously introduced, which uses only the initial dataset. However, the criterion based on this distance would require to generate a large number of replicates of the learned dataset and therefore would induce a larger numerical cost.