Optimality Conditions in Variational Form for Non-Linear Constrained Stochastic Control Problems

Optimality conditions in the form of a variational inequality are proved for a class of constrained optimal control problems of stochastic differential equations. The cost function and the inequality constraints are functions of the probability distribution of the state variable at the final time. The analysis uses in an essential manner a convexity property of the set of reachable probability distributions. An augmented Lagrangian method based on the obtained optimality conditions is proposed and analyzed for solving iteratively the problem. At each iteration of the method, a standard stochastic optimal control problem is solved by dynamic programming. Two academical examples are investigated.


1.
Introduction. This article is devoted to the derivation of optimality conditions for a class of control problems of stochastic differential equations (SDE). For a given adapted control process u, let (X 0,Y0,u where the mappings F and G are given and satisfy differentiability assumptions. The control set U 0 (Y 0 ) is a set of adapted stochastic processes. In the article, n denotes the dimension of the state variable and N the number of constraints. A precise description of problem (1) will be given in Section 2. In this paper, we call the mappings F and G linear if they can be written in the form F (m) = R n f (x) dm(x) and G(m) = R n g(x) dm(x), where f : R n → R and g : R n → R N . In this specific case, problem (1) is equivalent to the following stochastic optimal control problem with an expectation constraint: inf u∈U0(Y0) E f X 0,Y0,u T , subject to: E g X 0,Y0,u The terminology non-linear used in the title refers to the fact that the functions F and G for which our result applies are not necessarily linear. In other words, the cost function and the constraints are not necessarily formulated as expectations of functions of X 0,Y0,u T . This is the specificity of the present article. Stochastic optimal control problems with a non-linear cost function are mainly motivated by applications in economy and finance. In many situations, minimizing the expectation of a random cost may be unsatisfactory and one may prefer to take into account the "risk" induced by the dispersion of the cost. In the literature, there exist many different models of risk and some of them can be formulated as functions of the probability distribution of the state variable, as explained for example in [31,Chapter 6]. Some portfolio problems with risk-averse cost functions are studied in [26]. In [3, Section 5], a mean-variance portfolio selection problem is considered as well as in [22]. In [19], a gradient-based method is developed for minimization problems of the conditional value at risk, a popular risk-averse cost function. The risk can also be taken into account by considering a constraint of the form G(m 0,Y0,u T ) ≤ 0. For example, one can try to keep the probability of bankruptcy under a given threshold with a probability constraint. If G models the variance of the outcome of some industrial process, then a constraint on the variance can guarantee some uniformity in the outcome, which can be a desirable property. Final-time constraints have several applications in finance, see for example [33], where probability constraints are used for solving a super-replication problem or [32], where the final probability distribution is fixed for solving a determination problem of no-arbitrage bounds of exotic options. Finally, let us mention that a problem of the form (1) can be seen as a simple model for an optimization problem of a multi-agent system, where a government, that is to say a centralized controller, influences a (very large) population of agents, whose behavior is described by a SDE (see for example [2] and the references therein on this topic). Let us mention however that for general multi-agent models, the coefficients of the SDE at time t depend on the current probability distribution m 0,Y0,u t . Let us describe the available results in the literature related to optimality conditions for problems similar to Problem (1). Problem (1), without constraints, is a specific case of an optimal control problem of a McKean-Vlasov process (or mean-field-type control problem). For this more general class of problems, the drift and volatility of the SDE possibly depend at any time t on the current distribution m 0,Y0,u t . Optimality conditions usually take the form of a stochastic maximum principle: for a solutionū,ū t minimizes almost surely and for almost every time t a Hamiltonian involving a costate which is obtained as a solution to a backward stochastic differential equation (see for example [3], [4], [9], [13]). In a different but related approach, one can consider control processes in a feedback form, that is to say in the form u t = u(t, X 0,Y0,u t ), where the mapping u : [0, T ] × R n → U has to be optimized. Under regularity assumptions, the probability distribution m 0,Y0,u t has a density, say µ(t, x), which is the solution to the Fokker-Planck equation: where a = σσ and where the operators ∇· and ∇ 2 : are defined by ∂ xi f i (x) and ∇ 2 : g(x) = n i,j=1 respectively. A derivation of the Fokker-Planck can be found in [12,Lemma 3.3], for example. Note that if b and σ do not depend on µ, then the Fokker-Planck equation is a linear partial differential equation. In this approach, the problem is an optimal control problem of the Fokker-Planck equation and optimality conditions take the form of a Pontryagin's maximum principle. The adjoint equation is in this setting an HJB equation. We refer the reader to [1], [2], [4], and [15] for this approach.
In this article, optimal control problems of the following form: are called standard problems, considering the fact that they have been extensively studied in the last decades. They can be solved by dynamic programming, by computing the solution to the associated Hamilton-Jacobi-Bellman equation (see the textbooks [16], [28], [35] on this field). Since standard problems are of form (2) (without constraints), they fall into the general class of problems investigated in the paper. The optimality conditions provided in the present article can be shortly formulated as follows: ifū is a solution to (1) and satisfies a qualification condition, then it is also the solution to a standard problem of the form (3), where the involved function φ is the derivative at m 0,Y0,u T (in a specific sense) of the Lagrangian of the problem L := F + λ, G , for some non-negative Lagrange multiplier λ satisfying a complementarity condition. Our optimality conditions therefore take the form of a variational inequality. Our analysis relies on the following technical result: the closure (for the Wasserstein distance associated with the L 1 -distance) of the set of reachable probability distributions at time T is convex. This property is proved by constructing controls imitating the behaviour of relaxed controls. To the best of our knowledge, the optimality conditions for problem (1) are new, as well as the convexity property satisfied by the reachable set 1 . They differ from the maximum principle mentioned above and are, to a certain extent, related to the optimality conditions obtained for mean-field-type control problems formulated with feedback laws. The presence of non-linear constraints is another novelty of the article; in the literature, most constraints of the form G(m 0,Y0,u T ) ≤ 0 are expectation constraints, see for example [8], [27]. The existence of a Lagrange multiplier, even in a linear setting, is not often considered, see [7,Section 5] or [23,Section 6]. Finally, we mention that we also prove the sufficiency of the optimality conditions in variational form, when F and G are convex.
It is well-known that problems of the form (1) are time-inconsistent, i.e. it is (in general) not possible to write a dynamic programming principle by parameterizing problem (1) by its initial time and initial condition, as is customary for standard problems of the form (3). A dynamic programming principle can be written if one considers the whole initial probability distribution as a state variable, see [18], [29], [30]. However, in practice, this approach does not allow, in general, to solve the problem, because the complexity of the method grows exponentially with the dimension of the (discretized) space of probability distributions. The optimality conditions in variational form and the convexity property proved in this article naturally lead to iterative methods for solving problem (1), based on successive resolutions of standard problems and thus overcoming the difficulty related to timeinconsistency. We propose, analyse, and test such a method in the article. The cost function of the standard problem to be solved at each iteration is the derivative, in a certain sense, of an augmented Lagrangian.
We give a precise formulation of the problem under study in Section 2. We also discuss the notion of differentiability which is used. In Section 3, we prove the convexity of the closure of the reachable set of probability distributions. Optimality conditions in variational form are proved in Section 4. The case of convex problems is discussed. Our numerical method for solving the problem is described and analyzed in Section 5. We provide results for two academical examples. Elements on optimal transportation theory are given in the appendix.
2. Formulation of the problem and assumptions.

Notation.
• The set of probability measures on R n is denoted by P(R n ). For a function φ : R n → R, its integral (if well-defined) with respect to the measure m ∈ P(R n ) is denoted by Given two measures m 1 and m 2 ∈ P(R n ), we denote: • For a given random variable X with values in R n , we denote by σ σ σ(X) the σ-algebra generated by X. • For a given vector x ∈ R q , we denote by |x| its Euclidean norm and by |x| ∞ its supremum norm. • For p ≥ 1, we denote by P p (R n ) the set of probability measures having a finite p-th moment: We recall that for 1 ≤ p ≤ q, the space P q (R n ) is included into P p (R n ). We equip P 1 (R n ) with the Wasserstein distance d 1 . The definition of the Wasserstein distance is recalled in the appendix. In the article, we only make use of the dual representation of d 1 [34, Remark 6.5]: for all m 1 , m 2 ∈ P 1 (R n ), where 1-Lip(R n ) := φ : R n → R |φ(y) − φ(x)| ≤ |y − x|, ∀(x, y) ∈ R n × R n is the set of real-valued Lipschitz continuous functions of modulus 1. • For all R ≥ 0, we definē • The open (resp. closed) ball in R n of radius r ≥ 0 and center 0 is denoted by B r (resp.B r ), its complement B c r (resp.B c r ), for the Euclidean norm.
• For a given p ≥ 1, we say that a function φ : R n → R q is dominated by |x| p if for all ε > 0, there exists r > 0 such that for all x ∈ B c r , |φ(x)| ≤ ε|x| p .
• The convex envelope of a set R is denoted conv(R). When R is a subset of P 1 (R n ), its closure for the d 1 -distance is denoted cl(R).

State equation.
We fix a final time T > 0 and a Brownian motion Let U be a compact subset of R k . Note that we do not make any other assumption on U : it can be non-convex, for example, or can be a discrete set. For a given random variable Y t independent of F t,T with values in R n , we define the set U 0 t as the set of control processes (u s ) s∈[t,T ] taking values in U such that for all s ∈ [t, T ], u s is F t,smeasurable and we define the set U t (Y t ) as the set of control processes The drift b : R n × U → R n and the volatility σ : The well-posedness of this SDE is ensured by Assumption A1 [21, Section 5] below. We also denote by m t,Yt,u s the probability distribution of X t,Yt,u s . All along the article, we assume that the following assumption holds true. From now on, the initial condition Y 0 and the real number p ≥ 2 introduced below are fixed.
Assumption A1. There exists K > 0 such that for all x, y ∈ R n , for all u, v ∈ U , The following lemma is classical, see for example [17,Section 2.5].
By Lemma A.2 (in the appendix),B p (R) is compact for the d 1 -distance, thus it is bounded. It follows that R and cl(R) are bounded. We can therefore consider, for future reference, the diameter of cl(R), defined by 2.3. Formulation of the problem and regularity assumptions. Consider two mappings F : P p (R n ) → R and G : P p (R n ) → R N . We aim at studying the following problem: Throughout the article, we assume that on top of Assumption A1, Assumptions A2 and A3 below, dealing with the continuity and the differentiability of F and G, are satisfied. The constant R used in these assumptions is given by (9).
Assumption A2. The restrictions of F and G toB p (R) are continuous for the d 1 -distance.
In order to state optimality conditions, we need a notion of derivative for the mappings F and G. Denoting by M(R n ) the set of finite signed measures on R n , we define: Let A : M p (R n ) → R q be a linear mapping. We say that the function φ : R n → R q is a representative of A if for all m ∈ M p (R n ), the integral R n φ(x) dm(x) is welldefined and equal to Am: If φ is a representative of A, then for any constant c ∈ R q , φ + c is also a representative of A, since R n c dm(x) = 0, ∀m ∈ M p (R n ).
Conversely, if the value of a representative φ is fixed for a given point x 0 ∈ R n , then for all x ∈ R n , the value of φ(x) is determined by where δ x and δ x0 are the Dirac measures centered at x and x 0 , respectively. Therefore, the representative, if it exists, is uniquely defined up to a constant.
2. We say that f is strictly differentiable at 0 if there exists Df (0) ∈ R ×2 such that , as z 1 → 0 and z 2 → 0.
1. For all m, m 1 , and m 2 ∈B p (R), there exists a linear form DF (m) : M p (R n ) → R such that the mapping is Gâteaux-differentiable at 0, with derivative (DF (m)(m 2 −m), DF (m)(m 1 − m)). Moreover, DF (m) possesses a continuous representative, dominated by |x| p , and denoted x ∈ R n → DF (m, x) ∈ R.
In the article, we make use of the derivative DF (m) (a linear form from M p (R n ) to R) and its representative. The two notions can be distinguished according to the presence (or not) of the variable x.

2.4.
Discussion of the notion of derivative. A general class of cost functions satisfying Assumptions A2 and A3 can be described as follows. Let K ∈ N, let Ψ : R K → R q be differentiable, let φ : R n → R K be a continuous function dominated by |x| p . We define then on P p (R n ): Note that for all control processes u ∈ U 0 (Y 0 ), H m 0,Y0,u For all R ≥ 0, the continuity of H onB p (R) follows from Lemma A.3 (in the appendix). One can easily check that the mapping H is differentiable in the sense of Assumption A3.1. The representative of its derivative is given by up to a constant. Furthermore, if Ψ is continuously differentiable, then H is differentiable in the sense of Assumption A3.2. Note that the function φ does not need to be differentiable. Let us mention that the variance of a probability distribution m can be described with a function of the form (11), taking Further examples are discussed in detail in [25,Section 4]. We finish this subsection with two remarks. Remark 1. The fact that F should be defined on the whole space P p (R n ) discards cost functions whose formulation is based on the density of the probability measure (since a density does not always exists, for probability distributions in P p (R n )). For example, the following problem does not fit in the proposed framework: where PDF stands for probability density function and where f ref is a given probability density function.
Remark 2. The notion of derivative provided in [12,Section 6] and the one introduced in Assumption A3 are of different nature, because they aim at evaluating the variation of functions fromB p (R) to R on different kinds of paths. While our derivative is represented by a function from R n to R, the one of [12] is represented by a function from R n to R n (see [12,Theorem 6.5]). This difference of nature can be better understood by considering the mapping: m ∈ P p (R n ) → R n φ dm. This mapping is a monomial, according to the terminology given in [12, Example, page 43]. Its derivative (in the sense of [12]) is represented by the mapping: x ∈ R n → Dφ(x) ∈ R n (see [12,Example,page 44]). In the current framework, the derivative of m ∈ P p (R n ) → R n φ dm is the real-valued mapping x ∈ R n → φ(x) ∈ R, up to a constant.

Existence of a solution.
Observe that problem (P ) can be equivalently formulated as follows: is a solution to (P ) and conversely, ifm ∈ R is a solution to (P ), then anyū ∈ U 0 (Y 0 ) such thatm = m 0,Y0,ū T is a solution to (P ). The feasible set R ad of Problem (P ) is defined by By continuity of F for the The notation (P ) refers to the problem on the right. Lemma 2.1 enables us to prove the existence of a solution to problem (P ).
Lemma 2.4. If R ad is non-empty, then problem (P ) has a solution.
Proof. It is proved in Lemma A.2 that the setB p (R) is compact for the d 1 -distance. By Lemma 2.1, R ⊆B p (R) and therefore, R ad ⊆B p (R). SinceB p (R) is closed, cl(R ad ) ⊆B p (R) and therefore, cl(R ad ) is compact, since it is a closed set of a compact set. The existence of a solution follows, since F is continuous.
Although we have an existence result for (P ), it is in general difficult to prove the existence of a solution to (P ).
3. Convexity of the reachable set. This subsection is dedicated to the proof of the convexity of the closure of R (the set of reachable probability distributions at time T ). This result is an important tool for the proof of the optimality conditions in Section 4 and for the numerical method developed in Section 5. Let us explain the underlying purpose with a simple example. Consider two processes u 1 and u 2 ∈ U 0 (Y 0 ) and the corresponding final probability distributions m 0,Y0,u1 T and m 0,Y0,u2 We aim at building a control process u such that A very simple way of building such a control process is to define a random variable S independent of Y 0 and F 0,T taking two different values with probability 1/2. A control u realizing (14) can then be constructed in U 0 (S, Y 0 ) : it suffices that u = u 1 for one value of S and that u = u 2 for the other. The obtained controlled process can be seen as a relaxed control, since it is now measurable with respect to a larger filtration. The main idea of Lemma 3.1 is to construct control processes in U 0 (Y 0 ) imitating the behaviour of the relaxed control process u.
Lemma 3.1. The closure of the set of reachable probability measures for the d 1distance, denoted cl(R), is convex.
Our approach mainly consists in proving that Let u 1 and u 2 in U 0 (Y 0 ). To prove (15), it suffices to prove that there exists a Let 0 < ε < T , letũ 1 andũ 2 be two processes in U ε (Y 0 ) such that for all k = 1 and k = 2, the processes (u k s ) s∈[0,T −ε] and (ũ k s ) s∈[ε,T ] can be seen as the same measurable function of respectively In other words, we simply delay the observation of the variation of the Brownian motion of a time ε. Let us denote by Fixing u 0 ∈ U , we define u ε ∈ U 0 (Y 0 ) as follows: where a k and b k are given by Let us first estimate a k . Using the Lipschitz-continuity of φ, we obtain that We deduce from the Cauchy-Schwarz inequality and Lemma 2.1 that Let us estimate b k . Sinceũ k and Y 0 are independent of A k and using the definition ofũ k , we obtain that .
We obtain with the Lipschitz-continuity of φ, the Cauchy-Schwarz inequality, and Lemma 2.1 that Combining (17), (18), and (19), we obtain that where M is a constant independent of φ and ε. Using the dual representation of d 1 (given by (4)), we deduce that This proves (16) and thus justifies (15). We can now conclude the proof. It follows from (15) that Since R ⊆ R, we have cl(R) ⊆ cl( R), and therefore by (20), cl(R) = cl( R). It remains to prove that cl( R) is convex, which is an easy task.
On the feasibility of the problem. Proving the feasibility (or infeasibility) of the problem is in general a difficult task. Still, Lemma 3.1 can be used to prove the existence of m ∈ cl(R) such that G(m) < 0 (this ensures the feasibility of the problem), or can be used to prove that for all m ∈ cl(R), G(m) > 0 (in this case the problem is infeasible). Given controls u 1 , u 2 ,...,u K ∈ U 0 (Y 0 ), we obtain with Lemma 3.1 that the set conv({m 0,Y0,u k T } k=1,...,K ) is included into cl(R). Thus, if there exists λ ∈ R K such that λ ≥ 0, K k=1 λ k = 1, and G K k=1 m 0,Y0,u k T < 0, then the problem is feasible.
Let us consider now the situation where G is of the form G(m) = Ψ( φ(x) dm(x)) with φ : R n → R K and Ψ : R K → R N , as proposed in subsection 2.4. Let us set If there exists z ∈ Z such that G(z) < 0, then the problem is feasible. If for all z ∈ Z, G(z) > 0, then the problem is infeasible. Thus, if we can describe in a convenient way the set Z, then the issue of feasibility is reduced to a feasibility problem of finite dimension and is likely to be easier.
Regarding the description of Z, a dual approach can be considered. The set Z is indeed convex, by Lemma 3.1. It is moreover compact, since cl(R) is compact and the mapping φ dm(x) continuous for the d 1 -distance. Given λ ∈ R K , we define Z(λ) = inf z∈Z λ, z . We have thus the quantity Z(λ) can be computed numerically by solving a standard problem. By the Hahn-Banach theorem, we have the following relation: Thus an outer approximation of Z can be computed by replacing the sphere of radius 1 (in R K ) by a discrete subset, in the right-hand side of the above relation.
In the situation of expectation constraints, that is G(m) = Ψ( φ(x) dm(x)) with Ψ(z) = z, the following problem can be considered: If the value of problem (21) is negative, then Problem P is feasible, if the value of Problem (21) is positive, then Problem (P ) is infeasible. For proving this result, one has to prove that (21) is the dual of: inf z∈Z sup λ∈R K λ, z , subject to |λ| = 1, λ ≥ 0, and one has to observe that for all z ∈ R K , z < 0 if and only if the following problem has a negative value: sup λ∈R K λ, z subject to |λ| = 1, λ ≥ 0.
4. Optimality conditions. We prove in this section the main result: if a control u is a solution to (P ) and satisfies a qualification condition, then it is the solution to a standard problem of the form (3). Before proving our result, we recall in Subsection 4.1 some well-known properties of the value function associated with a standard problem.
4.1. Standard problems. Let φ : R n → R be a continuous function dominated by |x| p . Let us define: The mapping Φ is linear, in so far as for all m 1 and m 2 ∈ P p (R n ), for all θ ∈ [0, 1], It is also continuous for the d 1 -distance, see Lemma A.3 (in the appendix). We denote by (P (φ)) the following standard problem: Letû ∈ U 0 (Y 0 ) and letm = m 0,Y0,û T . By continuity of Φ, the control processû is a solution to (P (φ)) if and only if We recall, for future reference, some well-known results concerning the value function associated with the standard problem (P (φ)). We refer to the textbooks [16], [28], [35] on this topic. The value function V associated with (P (φ)) is defined for all t ∈ [0, T ] and for all x ∈ R n by It can be characterized as the unique viscosity solution to the following Hamilton-Jacobi-Bellman (HJB) equation: where the Hamiltonian H is defined for x ∈ R n , u ∈ U , p ∈ R n , and Q ∈ R n×n by If V is sufficiently smooth, one can prove with a verification argument that any control process u is a global solution to (P (φ)) if almost surely, Finally, note that the value of problem (P (φ)) is given by where m denotes the probability distribution of Y 0 .

Main result.
In this subsection, we give first-order optimality conditions in variational form for problem (P ) (defined in the introduction, page 6). Along this subsection, a solutionū ∈ U 0 (Y 0 ) to problem (P ) is fixed. We also set We first give a metric regularity result (Theorem 4.1), which is a key tool for the proof of the optimality conditions (Theorem 4.3). Let us consider the sets A and I of active and inactive constraints atm, defined by We have The following assumption is a qualification condition.
The estimate (23) is basically an estimate of the distance of m θ to cl(R ad ). Indeed, by the convexity of cl(R), the probability measure (1 − η)m θ + ηm 0 lies in cl(R). Since G is continuous and since G (1 − η)m θ + ηm 0 < 0, the probability measure (1 − η)m θ + ηm 0 lies in cl(R ad ). It is at a distance ηd 1 (m θ , m 0 ) of m θ . The real number η ≥ 0 is of same order as the quantity |G A (m θ ) + | ∞ , which indicates how much the constraints are violated.
The above inequality (as well as all those involving vectors) must be understood coordinatewise.
In the following theorem, we prove first-order optimality conditions in variational form for problem (P ). In the sequel, we say that a non-negative Lagrange multiplier λ ∈ R N satisfies the complementarity condition atm if Theorem 4.3. Assume thatū is a solution to problem (P ). Letm = m 0,Y0,ū T . If Assumption A4 holds, then there exists a non-negative Lagrange multiplier λ ∈ R N satisfying the complementarity condition (31) atm which is such thatū is a solution to the standard problem (P (φ)) with φ(x) = DL(m, λ, x).

Remark 3.
We say then that the control processū satisfies the optimality conditions in variational form. The optimality ofū for the standard problem with φ(x) = DL(m, x) is equivalent to the following variational inequality: and also equivalent to the following inequality: In the sequel, we say that a probability measurem ∈ cl(R) such that G(m) ≤ 0 satisfies the optimality conditions in variational form if there exists a multiplier λ ≥ 0 satisfying the complementarity condition (31) and such that (32) holds. We mention that for a solutionm to (P ), not necessarily lying in R ad , one can extend the proof given below and demonstrate that the optimality conditions in variational form are also satisfied. For all y ∈ R N A , we consider the following optimization problem, denoted (LP (y)): Step 1. We first prove that V (0) = 0. For y = 0,m is feasible (for problem (LP (y)) with y = 0), thus V (0) ≤ DF (m)(m −m) = 0. Now, let m ∈ cl(R) be such that DG A (m)(m −m) ≤ 0. Letθ > 0 and C > 0 be given by Theorem 4.1. Let (θ k ) k∈N be a convergent sequence with limit 0 taking values in (0,θ]. For all k, we set By Assumption A3, we have Since G A (m) = 0 and DG A (m)(m −m) ≤ 0, we have By Theorem 4.1, there exists for all k ∈ N a real number η k ∈ [0, 1] such that Since cl(R) is convex (Lemma 3.1),m k ∈ cl(R). Therefore, for all k, by continuity of F and G, there existsm k ∈ R such that G(m k ) ≤ 0 and such that F (m k )−F (m k ) ≤ θ 2 k . Using the differentiability assumption on F (Assumption A3), the feasibility of m k , the fact that η k = o(θ k ) and (33), we obtain that It follows that DF (m)(m −m) ≥ 0 and finally proves that V (0) = 0.
Step 2. We compute now the Legendre-Fenchel transform (see [6,Relation 2.210] for a definition) of V . For all λ A ∈ R N A , we have Using the change of variable z = DG A (m)(m −m) + y, we obtain: Step 3. Using the convexity of cl(R) (Lemma 3.1), one can easily show that V is a convex function. Let α > 0 be such that (24)  The theorem is proved.
The approach which has been employed to prove Theorem 4.3 is similar to the one used in [5] for proving Pontryagin's principle in a deterministic framework. In this reference, an auxiliary convex optimization problem (denoted by (P L), introduced at the end of Section 3, and involving the final-state variable) is introduced. Pontryagin's principle is obtained by analyzing the dual of that problem, which plays a similar role as Problem (LP (y)) in the proof of Theorem 4.3.
The following lemma shows that the value of the standard problem can be used to estimate the loss of optimality of a given probability measurem in R ad (defined by (13)), when the mappings F , G 1 ,...,G N are convex. The calculations involved in the proof are rather standard, see for example [20,Theorem 12.14].
Lemma 4.4. Denote by Val(P ) the value of Problem (P ). Assume that the mappings F , G 1 ,...,G N are convex. Then, for allm ∈ R ad , for all non-negative λ ∈ R N such that the complementarity condition holds atm, the following upper estimate holds: Proof. Let m ∈ R ad . Since F is convex, we have Denoting by A the active set atm and setting G A (m) = (G j (m)) j∈A and λ A = (λ j ) j∈A , we obtain, using the feasibility of m and the convexity of G 1 ,...,G m that Since λ A ≥ 0, we deduce that By the complementarity condition, Minimizing successively both sides with respect to m, we obtain that Since R ad ⊆ cl(R), we finally obtain that which concludes the proof.
As a corollary, we obtain that the optimality conditions in variational form are sufficient optimality conditions, in the convex case.
Corollary 1. Assume that F ,G 1 ,..,G N are convex. Letû be a feasible control process satisfying the optimality conditions in variational form. Then,û is a solution to (P ).
Proof. In this situation, the right-hand side of inequality (35) is equal to 0, which directly proves the optimality ofû.
We finish this section with a corollary dealing with stochastic optimal control problems with an expectation constraint.
Corollary 2. Let f : R n → R, g : R n → R N be two continuous functions, dominated by |x| p . Assume that there exists u 0 ∈ U 0 (Y 0 ) such that E g(X 0,Y0,u0 T ) < 0. Then, any u ∈ U 0 (Y 0 ) is a solution to the following problem: if and only if u is feasible and there exists λ ≥ 0 such that E g i (X 0,Y0,u T ) < 0 =⇒ λ i = 0, for all i = 1, ..., N , and such that u is a solution to (P (φ)) with φ(x) = f (x) + λ, g(x) .
Proof. Setting F (m) = R n f (x) dm(x) and G(m) = R n g(x) dm(x), we obtain that problem (39) falls into the general class of problems studied in the article. The functions F and G satisfy the required regularity assumptions (see Subsection 2.4). Note that DF (m, x) = f (x) and that DG(m, x) = g(x) (up to a constant). The existence of u 0 ensures that the qualification condition is satisfied. The mappings F and G are clearly convex, therefore, the optimality conditions in variational form are necessary and sufficient, by Theorem 4.3 and Corollary 1.

Numerical method and results.
5.1. Augmented Lagrangian method. We provide in this section a numerical method for solving problem (P ) and give results for two academical problems. The method is an augmented Lagrangian method combined with a projected-gradienttype algorithm.
Let us begin with a rough description of the method, consisting of Algorithms 1 and 2 (page 21). The second algorithm is a building block of the first one. The augmented Lagrangian method is used to solve the following problem: . The ultimate step of the algorithm (line 20) aims at recovering such a controlū by solving the standard problem (P (φ)) with φ = DL(m k , λ k ). One has to check a posteriori thatū approximately satisfies the optimality conditions in variational form with associated Lagrange multiplierλ = λ k .
Let us go into the details of the method. The augmented Lagrangian L A associated with (40) is given by The employed norm in the above definition is the Euclidean norm. Note that the constraints s ≥ 0 and m ∈ cl(R) are not dualized, since they will be ensured by the projected gradient method. The mapping L A (·, s, λ, c) is differentiable (in the sense of Assumption A3.1) with respect to m, with A representative ofm → DL A (m, s, λ, c)m is therefore given by up to a constant. The partial gradient of L A with respect to s is given by Let us first focus on Algorithm 2. It aims at solving the following problem: where m (θ) = (1 − θ)m + θm , s (θ) = max(s + θδs , 0) and where δs = −∇L A (m k , s k , λ, c). The max operator in the above expression must be understood coordinatewise, it is nothing but a projection of s + θδs on R N ≥0 . The probability distributionm is chosen as a solution to inf The value of the above problem is non-positive. The measurem − m can therefore be seen as a descent direction for the variable m. The next iterate of the algorithm is given by (45) Algorithm 2 stops when the variable ε (defined line 6 in the algorithm) is smaller than ω, i.e. when both The inequality (46) ensures that the optimality condition (44) is approximately satisfied (note that the left-hand side of (46) is always non-negative). The inequality (47) implies that s − max(s + δs , 0) ≥ −ω1, thus that s + δs ≤ max(s + δs , 0) ≤ s + ω1 and finally that Moreover, for all i = 1, ..., N , if s ,i > ω, then max(s ,i + δs ,i , 0) > 0 (by (47)) and therefore, s ,i + δs ,i = max(s ,i + δs ,i , 0) and finally, Inequalities (48) and (49) therefore ensure that the optimality condition (45) is approximately satisfied. Let us come back to Algorithm 1. It constructs a sequence (m k ) k=0,1,... of probability distributions in cl(R), a sequence of non-negative slack variables (s k ) k=0,1,... and a sequence of Lagrange multipliers (λ k ) k=0,1,... . Two sequences of tolerances are also constructed, (η k ) k=0,1,... and (ω k ) k=0,1,... , as well as a sequence of penalty parameters (c k ) k=0,1,... . At the iteration k, the augmented Lagrangian is minimized by using Algorithm 2 with λ = λ k and the current tolerance ω k . The triplet (m k+1 , s k+1 , ε k+1 ) is the output of Algorithm 2. Three cases are then considered.
• If |G(m k+1 ) + s k+1 | ≤ η k , then we consider that the penalty term c k is large enough. The Lagrange multiplier is updated as follows: This update rule is motivated by (41).

Remark 4.
In practice, the main difficulty in the method is the resolution of the standard problem. It consists of two phases: in a backward phase, the Hamilton-Jacobi-Bellman equation associated with the standard problem must be solved (see subsection 4.1). It provides an optimal control (for the standard problem) in a feedback form. One must then compute the probability distribution (m t ) t∈[0,T ] which is associated, in a forward phase.
Algorithm 2: Projected gradient method for minimizing L A (·, ·, λ, c) on . Let us mention that we do not tackle in this subsection the issues related to discretization. In general, termination proofs for line-search methods require that the function to be minimized is differentiable with a Lipschitz-continuous gradient. A similar assumption is therefore considered below.
Assumption A5. The mappings F and G are differentiable in the sense of Assumption A3. Moreover, there exist two constants K 1 > 0 and K 2 > 0 such that for all m 1 , m 2 , m 3 , and m 4 ∈B p (R), and such that Remark 6. It can be easily proved that under Assumption A5, F and G are Lipschitz continuous for the distance d 1 , with modulus K 2 .
Before starting the convergence analysis, we give an example of a mapping satisfying Assumption A5.
Lemma 5.1. If F and G are of the form m ∈B p (R) → Ψ( φ dm) where Ψ is differentiable with a Lipschitz continuous derivative on bounded sets, and where φ is globally Lipschitz continuous, then Assumption A5 holds.
Proof. Recall the expression of the derivative, given in this situation by (12). Let K a be the Lipschitz modulus of φ. For m ∈B p (R), by Hölder's inequality, Let K c be the Lipschitz modulus of DΨ on the ball of centre 0 and radius K b . Let K d be a bound of |DΨ| on the same ball. Using the dual representation of the Wasserstein distance d 1 , we obtain that Similarly, we also obtain that Thus, Assumption A5 holds with K 1 = K 2 a K c and K 2 = K a K d .
The following lemma provides some useful properties dealing with the Lipschitzcontinuity of the derivatives of the augmented Lagrangian.
Lemma 5.2. Under Assumption A5, for all λ ∈ R N , for all c > 0, for all bounded sets S, there exist three constants K 3 , K 4 and K 5 > 0 such that for all m 1 and m 2 ∈ cl(R), for all s 1 and s 2 ∈ S, Proof. It is clear that |∇ s L A (m 1 , s 1 , λ, c)| = |λ + c(G(m 1 ) + s 1 )| is bounded, since G is Lipschitz continuous and since cl(R) and S are bounded. The first inequality follows.
We obtain with the Lipschitz-continuity of G that which proves the second inequality. For proving the third inequality, we focus on the Lipschitz continuity of the mapping (s, m) → (G(m) + s) DG(m) (the other terms involved in DL A (·, ·, λ, c) can be easily treated). Let K and S > 0 be such that for all m 1 ∈ cl(R), for all s ∈ S, |G(m 1 )| ≤ K and |s 1 | ≤ S. We have The third inequality follows.
Proof. We do a proof by contradiction and therefore assume that the algorithm never terminates. Therefore, it generates a sequence (m , s , ε ) ∈N which is such that ε > ω, for all ∈ N. One can easily prove that the following set is bounded: since G is bounded on cl(R) and since for a fixed m ∈ cl(R), s ∈ R N → L A (m, s, λ, c) is linear-quadratic, with a dominant term c|s| 2 independent of m. In a similar way, one can prove that L A (·, ·, λ, c) is bounded from below. By construction, the sequence (L A (m , s , λ, c)) ∈N is decreasing, therefore, for all ∈ N, s ∈ S. Let K 3 , K 4 , and K 5 be the three constants given by Lemma 5.2, for the set S. The proof mainly consists in finding an upper estimate of the decay at a given iteration . This is achieved with estimate (53) below. Let us introduce some notation, used only in this proof. For θ ∈ [0, 1], we denote m(θ) = (1 − θ)m + θm and s(θ) = max(s + θδs , 0).
We also omit the arguments λ and c of the augmented Lagrangian (since they are fixed). Let θ ∈ [0, 1]. Observe first that by Lemma A.1, where D is the diameter of cl(R) (defined by (10) .
We split the first term as follows: .
Combining Lemma 5.2 with estimates (51) and (52), we obtain that Since s(θ) is the orthogonal projection of s +θδs on R N ≥0 and since s = s(0) ∈ R N ≥0 , we have It is proved in [10, Lemma 2.2] that Combining the last two estimates, we obtain that Let us estimate (b). We have For all ∈ N, we denotẽ Combining the three obtained upper estimates of (a 1 ), (a 2 ), and (b), we obtain that there exists a constant K, independent of , such that for all θ ∈ [0, 1], For all ∈ N, we defineθ = min 1,ε K .
Ifθ =ε K , then Otherwise,θ = 1 and K ≤ε . Therefore Therefore, for all ∈ N, Recall that the sequence (L A (m , s )) ∈N is bounded from below. LetL A be a lower bound. We deduce from the above estimate that for all q ∈ N, The sequence min(Kε 2 ,ε ) ∈N is therefore summable and thus converges to 0. It follows that (ε ) ∈N converges to 0. Sinceε is the sum of two non-negative terms, they both converge to 0, i.e.
Observe that the two conditions satisfied by (m,s) are the optimality conditions for the problem inf m∈cl(R), s∈R N |G(m) + s| 2 , subject to: s ≥ 0.
Proof of Proposition 2. Let us assume that the algorithm does not terminate. Let us first prove that there are infinitely many indices k such that |G(m k+1 ) + s k+1 | > η k . Suppose that it is not the case, then there exists k 0 such that for all k ≥ k 0 , |G(m k+1 ) + s k+1 | ≤ η k . Considering the update formulas for η k and ω k used in this situation (line 12), we obtain that η k −→ 0 and ω k −→ 0 and thus for some k ≥ k 0 sufficiently large, η k ≤ η * and ω k ≤ ω * . The algorithm necessarily terminates when these two inequalities hold, which is a contradiction.
We now prove that the sequence (s k ) k∈N is bounded. Let The following inequalities hold true: The value of the augmented Lagrangian is decreasing along the iterations of Algorithm 2. Moreover, the pair (m k+1 , s k+1 ) is obtained as an output of Algorithm 2, with initial value (m k , s k ). Therefore, Using (56) and denoting y k = G(m k ) + s k , we obtain that Dividing by c k /2 and adding |λ| 2 /c 2 k on both sides and factorizing, we obtain that Using the inequality and finally, by induction, for all q ≥ k 1 , Since the sequence (c k ) k≥k1 is a geometric sequence of ratio 10, the sequences (1/c k ) k≥k1 and (1/ √ c k ) k≥k1 are also geometric with ratios 1/10 and 1/ √ 10, respectively. These last two sequences are therefore summable and we deduce from (58) that (y k ) k∈N is bounded. Since (G(m k )) k∈N is bounded, we finally obtain that (s k ) k∈N is bounded.
Since cl(R) is compact and (s k ) k∈N bounded, the sequence (m k , s k ) k∈N possesses at least one accumulation point. Let (m,s) ∈ cl(R) × R N ≥0 be an accumulation point. To simplify, we assume that the whole sequence (m k , s k ) k∈N converges to (m,s). The arguments which follow can be easily adapted if only a subsequence converge to (m,s). Let m ∈ cl(R). For all k ∈ N, we have Dividing by c k−1 and passing to the limit, we obtain that G(m) +s), DG(m)(m −m) ≥ 0.
Minimizing the left-hand side with respect to m, we obtain (54). It remains to prove (55). For all k ≥ k 1 , we have Dividing by c k−1 and passing to the limit, we obtain that max(−(G(m) +s), 0) = 0, which proves that G(m) +s ≥ 0. Let i ∈ {1, ..., N } be such thats i > 0. For k large enough, ω k−1 < s k,i and therefore, as a consequence of (59), Dividing by c k−1 and passing to the limit, we obtain that G i (m) +s i ≤ 0 and therefore, we obtain that G i (m) +s i = 0, since G(m) +s ≥ 0, which proves that (55) holds.

5.3.
Results. We present numerical results for two academical problems. The considered SDE is the following for the two of them: One can only act on the drift, the volatility is constant and equal to 1. All standard problems are solved by dynamic programming. The corresponding HJB equation is discretized with a semi-Lagrangian scheme (see [11]), which consists in approximating the SDE by a controlled Markov chain, defined at times {0, δt, ..., T } with δt = 10 −2 and taking values in {−5, −5+δx, ..., 5−δx, 5}, with δx = 10 −3 and using reflecting boundary conditions. At any mesh point, the minimization of the Hamiltonian is realized by enumeration, for a discretized set of controls {−2, −2+δu, ..., 2} with δu = 10 −1 . As mentioned in Remark 4, the resolution of standard problems (as the one in Algorithm 2, line 4) is done in two phases. Once an optimal control u has been found by dynamic programming, the corresponding probability distribution is obtained by solving the Chapman-Kolmogorov equation associated with the discretized Markov chain. The minimization with respect to θ involved in the computation of a steplength (line 10, Algorithm 2) is done by enumeration. The considered discretized set is {0, δθ, 2δθ, ..., 1} with δθ = 10 −6 . Note this step of the method is computationally inexpensive (at least for the considered test cases).
Since the SDE (60) is linear with respect to u, the Hamiltonian H is itself linear with respect to u and one can expect that optimal controls only take the boundary values −2 and 2 when the derivative (w.r.t. x) of the solution to the HJB equation is positive (resp. negative). The optimal controls obtained below indeed take these values for most of the mesh points. This is why we worked with a rather coarse discretization of U . Test case 1: bounded variance. For the first test case, we consider the following cost function and constraint: where α = 0.4. Observe that F (m 0,Y0,u T ) = E X 0,Y0,u T : the cost function is the expectation of the final state. The mapping G(m 0,Y0,u T ) + α is the variance of X 0,Y0,u T . The cost function F is linear and G is of the form (11). The derivative of G at any m, obtained with (12), is a linear-quadratic function (with respect to x), given by Convergence results are shown on Figure 1. The tolerances η * and ω * are chosen equal, for values ranging from 10 −3 to 10 −6 . For each of these tolerances, the values of G(ū) andλ are provided. The column "Var. Ineq." contains the following value: which somehow indicates to what extent the variational inequality is satisfied. The column c shows the value of the penalty parameter at the last iteration. The last column shows the total number of standard problems which have to be solved.
It can be first observed that for tolerances below 10 −4 , the variational inequality is almost satisfied. The violation of the constraint is small and of the same order as the tolerances. The obtained Lagrange multipliers converge when the tolerance goes to 0. We also observe that the mechanism of Algorithm 1 avoids that the penalty term c becomes very high, for small tolerances.  The optimal control generated by the algorithm is shown on Figure 2a (page 30), the associated probability measure (at any time t) is shown on Figure 2b. The value function associated with the standard problem with cost function DL(m 0,Y0,ū T ,λ, ·) is represented on Figure 2c, we recall that it plays the role of an adjoint equation.
As expected, the optimal control has a kind of bang-bang structure. It is constant with respect to time, equal to −2 for x ≥ −1.6 and to 2 for x ≤ −1.6. If the same problem was solved without constraint, the optimal control would be equal to −2, in order to minimize the expectation of the final state. Here, the optimal control must be equal to 2 when x is smaller then −1.6 in order to keep the variance sufficiently small and to satisfy the constraint. Test case 2: expectation constraint. For this second test case, we consider the following cost function and constraint: with α = 0, 4. Note that both F and G are linear. Roughly speaking, the constraint G(m) ≤ 0 ensures that a proportion α of the final probability measure remains around 0. Convergence results are given in Figure 3. The value of G(ū) converges to 0, suggesting that the constraint is active for the undiscretized problem. Convergence of the Lagrange multiplier is observed. The variational inequality is exactly satisfied, since the derivatives of F and G do not depend on m. The value of the penalty parameter does not increase much.  The optimal control, the probability distribution (at any time) and the adjoint are provided in Figures 2d, 2e, and 2f (page 30). As can be observed, the optimal control only takes the boundary values. The value 2 is taken in a small region around x = 0, after t ≈ 0.4, which guarantees that a sufficiently large proportion of the distribution remains located around 0, as can be seen on the graph of the probability distribution. 6. Conclusion. We have proved optimality conditions for a class of constrained non-linear stochastic optimal control problems, using an appropriate concept of differentiability for the cost function and the constraints. The convexity of the closure of the reachable set of probability measures plays an essential role in the proof of these results. An augmented Lagrangian method, based on the convexity property and the optimality conditions has been proposed, demonstrating the relevance of these properties. Good convergence results have been obtained for examples with a one-dimensional state variable. Future work will focus on the extension of these results to more general problems, for example, for cost functions containing an integral cost depending on the current probability distribution.
Appendix A. Elements on optimal transportation. Wasserstein distance. Let us recall the definition of the Wasserstein distance, denoted by d 1 in this article. For all m 1 and m 2 in P 1 (R n ), d 1 (m 1 , m 2 ) = inf π∈Π(m1,m2) R n ×R n |y − x| dπ(x, y), the set Π(m 1 , m 2 ) being the set of transportation mappings from m 1 to m 2 defined as: π ∈ P(R 2n ) | π(A × R n ) = m 1 (A), π(R n × A) = m 2 (A), for all measurable A ⊂ R n . The last inequality follows from the dual representation (4). Maximizing the lefthand side with respect to φ ∈ 1-Lip(R n ), we obtain inequality (64).
A compactness property.
Lemma A.2. For all p > 1 and R ≥ 0, the subsetB p (R) of P 1 (R n ) (defined in (5)) is compact for the d 1 -distance.
Proof. We first prove thatB p (R) is compact for the weak topology of P(R n ). For all r ≥ 0 and for all m ∈B p (R), and thus, m(B c r ) ≤ R/r p → 0, meaning thatB p (R) is tight. By Prokhorov's theorem [34, Page 43],B p (R) is therefore precompact for the weak-topology. Now, let (m k ) k∈N be a sequence inB p (R) weakly converging tom ∈ P(R n ). For all r ≥ 0, the function min(|x| p , r) is continuous and bounded, thus: R n min(|x| p , r) dm(x) = lim k→∞ R n min(|x| p , r) dm k (x) ≤ R.
We obtain, using the monotone convergence theorem: R n |x| p dm(x) = lim r→∞ R n min(|x| p , r) dm(x) ≤ R, thusm ∈B p (R). Therefore,B p (R) is weakly closed, and thus weakly compact.
Finally, we need to prove that any weakly converging sequence (m k ) k∈N inB p (R) to somem ∈B p (R) also converges for the d 1 -distance. By [34, Definition 6.8/Theorem 6.9], it suffices to prove that Observe that for all r ≥ 0, for all m ∈B p (R), similarly to (65), we find that We obtain (66), making r tend to +∞.
A continuity property. We prove in the following lemma the continuity of linear mappings for the d 1 -distance onB p (R) under a growth condition.
Then, for all R ≥ 0, the following mapping: m ∈B p (R) → R n φ(x) dm(x) is continuous for the d 1 -distance.
Proof. Let (m k ) k∈N be a sequence inB p (R) converging tom ∈B p (R) for the d 1distance. Let ε > 0 and let r be such that (6) holds. We defineφ : x ∈ R n → φ(x) = φ(P (x)), where P is the orthogonal projection onB r . For all x ∈ R n , if x ∈B r , thenφ(x) = φ(x) and if x ∈B c r , then |φ(x)| = |φ(rx/|x|)| ≤ εr p ≤ ε|x| p . Thus, for all m ∈B p (R), By [34, Definition 6.8/Theorem 6.9], the convergence for the d 1 -distance implies the weak convergence, thus, sinceφ is continuous and bounded, we obtain: The result follows when ε tends to 0.