\`x^2+y_1+z_12^34\`
Advanced Search
Article Contents
Article Contents

Recent developments in machine learning methods for stochastic control and games

  • *Corresponding author: Mathieu Laurière

    *Corresponding author: Mathieu Laurière

The first author was partially supported by the NSF grant DMS-1953035 and a grant from the Simons Foundation (MP-TSM-00002783).

Abstract / Introduction Full Text(HTML) Figure(21) Related Papers Cited by
  • Stochastic optimal control and games have a wide range of applications, from finance and economics to social sciences, robotics, and energy management. Many real-world applications involve complex models that have driven the development of sophisticated numerical methods. Recently, computational methods based on machine learning have been developed for solving stochastic control problems and games. In this review, we focus on deep learning methods that have unlocked the possibility of solving such problems, even in high dimensions or when the structure is very complex, beyond what traditional numerical methods can achieve. We consider mostly the continuous time and continuous space setting. Many of the new approaches build on recent neural-network-based methods for solving high-dimensional partial differential equations or backward stochastic differential equations, or on model-free reinforcement learning for Markov decision processes that have led to breakthrough results. This paper provides an introduction to these methods and summarizes the state-of-the-art works at the crossroad of machine learning and stochastic control and games.

    Mathematics Subject Classification: Primary: 49N70, 49N80, 68T07.

    Citation:

    \begin{equation} \\ \end{equation}
  • 加载中
  • Figure 1.  A case study of the COVID-19 pandemic in three states: New York (NY), New Jersey (NJ), and Pennsylvania (PA) in [294]. Plots of optimal policies (top-left), Susceptibles (top-right), Exposed (bottom-left), and Infectious (bottom-right) for three states: New York (blue), New Jersey (orange), and Pennsylvania (green). Large $ \ell $ indicates high intensity of lockdown policy. Choices of parameters are referred to [294, Section 4.2]

    Figure 2.  The illustrative linear quadratic model in Section 1.2. Panels (a) and (b) give three trajectories of $ X_t $, $ m_t = \mathbb{E}[X_t \vert \mathcal{F}_t^{W^0}] $ (solid lines) and their approximations $ \widehat X_t $ (dashed lines) using different realizations of $ (X_0, W, W^0) $ from validation data. Panel (c) shows the minimized cost computed using validation data over fictitious play iterations. Parameter choices are given in [243, Section 5]

    Figure 3.  The linear-quadratic regulator problem with delay in Section 2.6.1. Left: Training curve of two models in the example of linear-quadratic problem. Right: The effect of lag time $ \bar{\delta} $ processed by the feedforward model in the example of the linear-quadratic problem. The lag time $ \delta $ in the actual system is $ 1 $

    Figure 4.  The linear-quadratic regulator problem with delay in Section 2.6.1. A sample path of the first 5 dimensions of the state $ X_t $ and control $ \alpha_t $ obtained from the LSTM (top) model and FNN (top) model. Left: the optimal state process discretized from the analytical solution $ (X_t)_i $ (solid lines) and its approximation $ (\hat{X}_t)_i $ (dashed lines) provided by the approximating control, under the same realized path of Brownian motion. Right: comparisons of the optimal control $ ( \alpha_t)_i $ (solid lines) and $ (\hat{ \alpha}_t)_i $ (dashed lines)

    Figure 5.  Price impact MFC example in Section 2.6.2 solved by direct method. Left: Control learnt (dots) and exact solution (lines). Right: associated empirical state distribution. Here we take $ \gamma = 0.2 $ in (24)

    Figure 6.  Price impact MFC example in Section 2.6.2 solved by direct method. Left: Control learned (dots) and exact solution (lines). Right: associated empirical state distribution. Here, $ \gamma = 1 $ in (24)

    Figure 7.  Comparisons of cost functions and optimal trajectories for $ N = 24 $ players in the linear quadratic systemic risk problem in Section 3.1.2. Left: the maximum relative errors of the cost functions for 24 players; Right: for the sake of clarity, the comparison of optimal trajectories is only presented for the $ 1^\text{st} $, $ 4^\text{th} $, $ 7^\text{th} $, $ 10^\text{th} $, $ 13^\text{th} $, $ 16^\text{th} $, $ 19^\text{th} $ and $ 22^\text{th} $ players, where the solid lines are given by the closed-form solution and the stars are computed by deep fictitious play

    Figure 8.  Comparisons of trajectories for $ N = 24 $ players in the linear quadratic game in Section 3.1.2 using the learnt equilibrium strategy profile. For the sake of clarity, we only show the mean (blue triangles) and standard deviation (red bars) of trajectories errors for the $ 1^\text{st} $, $ 4^\text{th} $, $ 7^\text{th} $, $ 10^\text{th} $, $ 11^\text{th} $, $ 13^\text{th} $, $ 16^\text{th} $, $ 19^\text{th} $ and $ 22^\text{th} $ player, respectively. The results are based on a total of $ 65536 $ sample paths. They show that deep fictitious play provides a relatively uniformly good accuracy

    Figure 9.  Comparisons of optimal controls for $ N = 24 $ players in the linear quadratic game in Section 3.1.2. For clarity, we only show two sample paths of optimal controls for the $ 1^\text{st} $, $ 4^\text{th} $, $ 7^\text{th} $, $ 10^\text{th} $, $ 11^\text{th} $, $ 13^\text{th} $, $ 16^\text{th} $, $ 19^\text{th} $ and $ 22^\text{th} $ player, respectively. The solid lines are optimal controls given by the closed-form solution, and the dotted dash lines are computed by deep fictitious play

    Figure 10.  Linear-quadratic systemic risk example in Section 3.1.3. The relative squared errors of $ u^i $ (left) and $ \nabla u^i $ (right) along the training process of deep fictitious play for the inter-bank game. The relative squared errors of $ u^i(0, \check {\mathit{\boldsymbol{X}}}_0^{i}) $ and $ \{\nabla u^i(t_n, \check {\mathit{\boldsymbol{X}}}_n^{i})\}_{n = 0}^{N_{T-1}} $ are evaluated

    Figure 11.  Linear-quadratic systemic risk example in Section 3.1.3. A sample path for each player of the inter-bank game with $ N = 10 $. Top: the optimal state process $ X_t^i $ (solid lines) and its approximation $ \hat{X}_t^i $ (circles) provided by the optimized neural networks, under the same realized path of Brownian motion. Bottom: comparisons of the strategies $ \alpha_t^i $ and $ \hat{\alpha}_t^i $ (dashed lines)

    Figure 12.  Flowchart of one iteration in the Sig-DFP Algorithm. Input: idiosyncratic noise $ W $, common noise $ W^0 $, initial position $ X_0 $ and vector $ \hat{\nu}^{(\mathtt{{k}}-1)} $ from the last iteration. Output: vector $ \hat{\nu}^{(\mathtt{{k}})} $ for the next iteration

    Figure 13.  MFG of optimal consumption and investment in Section 3.2.2. Panels (a) and (b) give three trajectories of $ X_t $ and $ m_t = \exp\bigl( \mathbb{E}(\log X_t | \mathcal{F}^0_t)\bigr) $ (solid lines) and their approximations $ \hat X_t $ and $ \hat m_t $ (dashed lines) using different $ (X_0, W, W^0) $ from validation data. Panel (c) shows the maximized utility computed using validation data over fictitious play iterations. Parameter choices are: $ \delta \sim U(2, 2.5), b \sim U(0.25, 0.35), \sigma\sim U(0.2, 0.4), \theta, \xi \sim U(0, 1), \sigma^0\sim U(0.2, 0.4) $, $ \epsilon\sim U(0.5, 1) $

    Figure 14.  Systemic risk MFG example solved by the algorithm described in Section 3.2.3. Left: three sample trajectories of $ X $ using the neural network approach ('Deep Solver' with full lines, in cyan, blue, and green) or using the analytical formula ('benchmark' with dashed lines, in orange, red and purple). Right: three sample trajectories of $ Y $ (similar labels and colors). Note that the analytical formula satisfies the true terminal condition, whereas the solution computed by neural networks satisfies it only approximately since the trajectories are generated in a forward way starting from the learned initial condition

    Figure 15.  Trade crowding MFG example in Section 3.2.4 solved by DGM. Evolution of the distribution $ m $. Left: surface with the horizontal axes representing time and space and the vertical axis representing the value of the density. Right: contour plot of the density with a dashed red line corresponding to the mean of the density computed by the semi-explicit formula

    Figure 16.  Trade crowding MFG example in Section 3.2.4 solved by DGM. Each plot corresponds to the control at a different time step: Optimal control $ \alpha^* $ (dashed line) and learned control (full line)

    Figure 17.  MFG Cybersecurity example in Section 3.2.4. Test case 1: Evolution of the distribution $ m^{m_0} $ (left) and the value function $ u^{m_0} $ and $ \mathcal{U}(\cdot, \cdot, m^{m_0}(\cdot)) $ (right) for $ m_0 = (1/4, 1/4, 1/4, 1/4) $. First published in [223] by the American Mathematical Society

    Figure 18.  MFG Cybersecurity example in Section 3.2.4. Test case 2: Evolution of the distribution $ m^{m_0} $ (left) and the value function $ u^{m_0} $ and $ \mathcal{U}(\cdot, \cdot, m^{m_0}(\cdot)) $ (right) for $ m_0 = (1, 0, 0, 0) $. First published in [223] by the American Mathematical Society

    Figure 19.  MFG Cybersecurity example in Section 3.2.4. Test case 3: Evolution of the distribution $ m^{m_0} $ (left) and the value function $ u^{m_0} $ and $ \mathcal{U}(\cdot, \cdot, m^{m_0}(\cdot)) $ (right) for $ m_0 = (0, 0, 0, 1) $. First published in [223] by the American Mathematical Society

    Figure 20.  Cybersecurity MFC model solved with DDPG in Section 4.1.2: Evolution of the population distribution for five initial distributions

    Figure 21.  MFG described in Section 4.2.2, solved with fictitious play and DDPG. Left: $ L^2 $ error on the analytical control; right: stationary distribution. Results obtained by 125 iterations of fictitious play

  • [1] B. Acciaio, J. Backhoff-Veraguas and R. Carmona, Extended mean field control problems: stochastic maximum principle and transport perspective, SIAM journal on Control and Optimization, 57 (2019), 3666-3693, 2019. doi: 10.1137/18M1196479.
    [2] Y. AchdouF. Camilli and I. Capuzzo-Dolcetta, Mean field games: numerical methods for the planning problem, SIAM Journal on Control and Optimization, 50 (2012), 77-109.  doi: 10.1137/100790069.
    [3] Y. AchdouF. Camilli and I. Capuzzo-Dolcetta, Mean field games: convergence of a finite difference method., SIAM J. Numer. Anal., 51 (2013), 2585-2612.  doi: 10.1137/120882421.
    [4] Y. Achdou and I. Capuzzo-Dolcetta, Mean field games: numerical methods, SIAM J. Numer. Anal., 48 (2010), 1136-1162.  doi: 10.1137/090758477.
    [5] Y. Achdou and Z. Kobeissi, Mean field games of controls: Finite difference approximations, Mathematics in Engineering, 3 (2021), 1-35.  doi: 10.3934/mine.2021024.
    [6] Y. Achdou and M. Laurière, On the system of partial differential equations arising in mean field type control, Discrete & Continuous Dynamical Systems, 35 (2015), 3879.  doi: 10.3934/dcds.2015.35.3879.
    [7] Y. Achdou and M. Laurière, Mean Field Type Control with Congestion (II): An augmented Lagrangian method, Appl. Math. Optim., 74 (2016), 535-578.  doi: 10.1007/s00245-016-9391-z.
    [8] Y. Achdou and M. Laurière, Mean field games and applications: Numerical aspects, Mean Field Games, (2020), 249-307. doi: 10.1007/978-3-030-59837-2_4.
    [9] Y. Achdou, M. Lauriere and P.-L. Lions, Optimal control of conditioned processes with feedback controls, Journal de Mathématiques Pures et Appliquées, 2020. doi: 10.1016/j.matpur.2020.07.014.
    [10] N. Agram, A. Bakdi and B. Oksendal, Deep learning and stochastic mean-field control for a neural network model, SSRM preprint ssrn.3683722, 2020. doi: 10.2139/ssrn.3639022.
    [11] A. Al-AradiA. CorreiaG. JardimD. de Freitas Naiff and Y. Saporito, Extensions of the deep galerkin method, Applied Mathematics and Computation, 430 (2022), 127287.  doi: 10.1016/j.amc.2022.127287.
    [12] N. AlmullaR. Ferreira and D. Gomes, Two numerical approaches to stationary mean-field games, Dynamic Games and Applications, 7 (2017), 657-682.  doi: 10.1007/s13235-016-0203-5.
    [13] B. Anahtarci, C. D. Kariksiz, and N. Saldi., Q-learning in regularized mean-field games., Dynamic Games and Applications, 13(1): 89–117, 2023. doi: 10.1007/s13235-022-00450-2.
    [14] D. Andersson and B. Djehiche, A maximum principle for SDEs of mean-field type, Appl. Math. Optim., 63 (2011), 341-356.  doi: 10.1007/s00245-010-9123-8.
    [15] R. Andreev., Preconditioning the augmented Lagrangian method for instationary mean field games with diffusion, SIAM J. Sci. Comput., 39 (2017), A2763–A2783. doi: 10.1137/16M1072346.
    [16] A. Angiuli, J.-P. Fouque, and M. Laurière., Unified reinforcement Q-learning for mean field game and control problems., To appear inMathematics of Control, Signals, and Systems (MCSS), arXiv: 2006.13912, 2020.
    [17] A. AngiuliC. V. GravesH. LiJ.-F. ChassagneuxF. Delarue and R. Carmona, Cemracs 2017: numerical probabilistic approach to MFG, ESAIM: Proceedings and Surveys, 65 (2019), 84-113.  doi: 10.1051/proc/201965084.
    [18] P. K. Asea and P. J. Zak, Time-to-build and cycles, Journal of Economic Dynamics and Control, 23 (1999), 1155-1175.  doi: 10.1016/S0165-1889(98)00052-9.
    [19] M. Assouli and B. Missaoui, Deep learning for Mean Field Games with non-separable Hamiltonians, Chaos, Solitons & Fractals, 174 (2023), 113802.  doi: 10.1016/j.chaos.2023.113802.
    [20] A. Aurell, R. Carmona, G. Dayanıklı, and M. Laurière, Finite state graphon games with applications to epidemics, Dynamic Games and Applications, (2022), 1-33. doi: 10.1007/s13235-021-00410-2.
    [21] A. Aurell, R. Carmona, G. Dayanikli and M. Lauriere, Optimal incentives to mitigate epidemics: a stackelberg mean field game approach, SIAM Journal on Control and Optimization, 60 (2022), S294-S322. doi: 10.1137/20M1377862.
    [22] A. Bachouch, C. Huré, N. Langrené, and H. Pham, Deep neural networks algorithms for stochastic control problems on finite horizon: numerical applications, Methodology and Computing in Applied Probability, (2021), 1-36. doi: 10.1007/s11009-019-09767-9.
    [23] A. BalataC. HuréM. LaurièreH. Pham and I. Pimentel, A class of finite-dimensional numerically solvable McKean-Vlasov control problems, ESAIM: Proceedings and Surveys, 65 (2019), 114-144.  doi: 10.1051/proc/201965114.
    [24] V. Bally and G. Pages, Error analysis of the optimal quantization algorithm for obstacle problems, Stochastic Processes and Their Applications, 106 (2003), 1-40.  doi: 10.1016/S0304-4149(03)00026-7.
    [25] V. Bally, G. Pages and J. Printems, A stochastic quantization method for nonlinear problems, Monte Carlo Methods and Applications, 7 (2001), 21-34. doi: 10.1515/mcma.2001.7.1-2.21.
    [26] M. Bardi, M. Falcone and P. Soravia., Numerical methods for pursuit-evasion games via viscosity solutions, In Stochastic and Differential Games: Theory and Numerical Methods, Springer, (1999), 105-175. doi: 10.1007/978-1-4612-1592-9_3.
    [27] M. Bardi, T. Raghavan and T. Parthasarathy, Stochastic and differential games: theory and numerical methods, volume 4, Springer Science & Business Media, 1999. doi: 10.1007/978-1-4612-1592-9.
    [28] M. Barnett, W. Brock, L. P. Hansen, R. Hu and J. Huang, A deep learning analysis of climate change, innovation, and uncertainty, arXiv: 2310.13200, 2023. doi: 10.2139/ssrn.4607233.
    [29] C. Barrera-EsteveF. BergeretC. DossalE. GobetA. MeziouR. Munos and D. Reboul-Salze, Numerical methods for the pricing of swing options: a stochastic control approach, Methodology and computing in applied probability, 8 (2006), 517-540.  doi: 10.1007/s11009-006-0427-8.
    [30] T. Başar, A tutorial on dynamic and differential games, Dynamic games and applications in economics, (1986), 1-25. doi: 10.1007/978-3-642-61636-5_1.
    [31] H. Bauer and U. Rieder, Stochastic control problems with delay, Mathematical Methods of Operations Research, 62 (2005), 411-427.  doi: 10.1007/s00186-005-0042-4.
    [32] E. BayraktarA. Budhiraja and A. Cohen, A numerical scheme for a mean field game in some queueing systems based on markov chain approximation method, SIAM Journal on Control and Optimization, 56 (2018), 4017-4044.  doi: 10.1137/17M1154357.
    [33] E. Bayraktar, N. Bauerle, and A. D. Kara, Finite approximations and Q learning for mean field type multi agent control, arXiv: 2211.09633, 2022.
    [34] E. BayraktarQ. Feng and Z. Zhang, Deep signature algorithm for multidimensional path-dependent options, SIAM Journal on Financial Mathematics, 15 (2024), 194-214.  doi: 10.1137/23M1571563.
    [35] R. W. BeardG. N. Saridis and J. T. Wen, Galerkin approximations of the generalized hamilton-jacobi-bellman equation, Automatica, 33 (1997), 2159-2177.  doi: 10.1016/S0005-1098(97)00128-3.
    [36] C. BeckW. E and A. Jentzen, Machine learning approximation algorithms for high-dimensional fully nonlinear partial differential equations and second-order backward stochastic differential equations., Journal of Nonlinear Science, 29 (2019), 1563-1619.  doi: 10.1007/s00332-018-9525-3.
    [37] S. BeckerP. Cheridito and A. Jentzen, Deep optimal stopping, Journal of Machine Learning Research, 20 (2019), 74. 
    [38] S. BeckerP. Cheridito and A. Jentzen, Pricing and hedging American-style options with deep learning, Journal of Risk and Financial Management, 13 (2020), 158.  doi: 10.3390/jrfm13070158.
    [39] R. Bellman, A Markovian decision process, Journal of Mathematics and Mechanics, (1957), 679-684. doi: 10.1512/iumj.1957.6.56038.
    [40] J.-D. Benamou and G. Carlier, Augmented Lagrangian methods for transport optimization, mean field games and degenerate elliptic equations, J. Optim. Theory Appl., 167 (2015), 1-26.  doi: 10.1007/s10957-015-0725-9.
    [41] C. BenderN. Schweizer and J. Zhuo, A primal–dual algorithm for BSDEs., Mathematical Finance, 27 (2017), 866-901.  doi: 10.1111/mafi.12100.
    [42] C. Bender and J. Steiner, Least-squares Monte Carlo for Backward SDEs, Springer, 2012. doi: 10.1007/978-3-642-25746-9_8.
    [43] C. Bender and J. Steiner, A posteriori estimates for backward SDEs, SIAM/ASA Journal on Uncertainty Quantification, 1 (2013), 139-163.  doi: 10.1137/120878689.
    [44] A. Bensoussan, Estimation and Control of Dynamical Systems, volume 48, Springer, 2018. doi: 10.1007/978-3-319-75456-7.
    [45] A. Bensoussan, J. Frehse and S. C. P. Yam, Mean Field Games and Mean Field Type Control Theory, Springer Briefs in Mathematics. Springer, New York, 2013. doi: 10.1007/978-1-4614-8508-7.
    [46] A. BensoussanJ. Frehse and S. C. P. Yam, The master equation in mean field theory, J. Math. Pures Appl., 103 (2015), 1441-1474.  doi: 10.1016/j.matpur.2014.11.005.
    [47] J. BernerP. GrohsG. Kutyniok and  P. PetersenThe Modern Mathematics of Deep Learning, Cambridge University Press, 2022.  doi: 10.1017/9781009025096.002.
    [48] D. P. Bertsekas and S. E. Shreve, Stochastic Optimal Control: The Discrete-Time Case, volume 5, Athena Scientific, 1996.
    [49] D. BloembergenK. TuylsD. Hennes and M. Kaisers, Evolutionary dynamics of multi-agent learning: A survey, Journal of Artificial Intelligence Research, 53 (2015), 659-697.  doi: 10.1613/jair.4818.
    [50] H. BoedihardjoX. GengT. Lyons and D. Yang, The signature of a rough path: uniqueness, Advances in Mathematics, 293 (2016), 720-737.  doi: 10.1016/j.aim.2016.02.011.
    [51] J. F. Bonnans and H. Zidani, Consistency of generalized finite difference schemes for the stochastic hjb equation, SIAM Journal on Numerical Analysis, 41 (2003), 1008-1021.  doi: 10.1137/S0036142901387336.
    [52] P. Bonnier, P. Kidger, I. P. Arribas, C. Salvi and T. Lyons, Deep signature transforms, In Advances in Neural Information Processing Systems 32 (NeurIPS), 2019.
    [53] B. BouchardR. Elie and N. Touzi, Discrete-time approximation of bsdes and probabilistic schemes for fully nonlinear PDEs, Advanced Financial Modelling, 8 (2009), 91-124.  doi: 10.1515/9783110213140.91.
    [54] B. Bouchard and N. Touzi, Discrete-time approximation and Monte-Carlo simulation of backward stochastic differential equations, Stochastic Processes and Their Applications, 111 (2004), 175-206.  doi: 10.1016/j.spa.2004.01.001.
    [55] M. Boulbrachene and M. Haiour, The finite element approximation of hamilton-jacobi-bellman equations, Computers & Mathematics with Applications, 41 (2001), 993-1007.  doi: 10.1016/S0898-1221(00)00334-5.
    [56] A. Briani and P. Cardaliaguet, Stable solutions in potential mean field game systems, Nonlinear Differential Equations and Applications, 25 (2018), 1.  doi: 10.1007/s00030-017-0493-3.
    [57] L. M. Briceño AriasD. KaliseZ. KobeissiM. LaurièreA. Mateos González and F. J. Silva, On the implementation of a primal-dual algorithm for second order time-dependent mean field games with local couplings, ESAIM: ProcS, 65 (2019), 330-348.  doi: 10.1051/proc/201965330.
    [58] L. M. Briceño AriasD. Kalise and F. J. Silva, Proximal methods for stationary mean field games with local couplings, SIAM J. Control Optim., 56 (2018), 801-836.  doi: 10.1137/16M1095615.
    [59] G. W. Brown, Some notes on computation of games solutions, Technical Report, Rand Corp Santa Monica CA, 1949.
    [60] G. W. Brown, Iterative solution of games by fictitious play, Activity Analysis of Production and Allocation, 13 (1951), 374-376. 
    [61] A. Budhiraja and K. Ross, Convergent numerical scheme for singular stochastic control with state constraints in a portfolio selection problem, SIAM Journal on Control and Optimization, 45 (2007), 2169-2206.  doi: 10.1137/050640515.
    [62] L. BusoniuR. Babuska and B. De Schutter, A comprehensive survey of multiagent reinforcement learning, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38 (2008), 156-172.  doi: 10.1109/TSMCC.2007.913919.
    [63] F. Camilli, S. Duisembay and Q. Tang, Approximation of an optimal control problem for the time-fractional fokker-planck equation, Journal of Dynamics & Games, 8 (2021). doi: 10.3934/jdg.2021013.
    [64] H. Cao, X. Guo, and M. Laurière, Connecting GANs, mean-field games, and optimal transport, To appear in SIAM Journal on Applied Mathematics, arXiv: 2002.04112, 2020.
    [65] P. Cardaliaguet., Notes on mean field games, 2013.
    [66] P. Cardaliaguet, F. Delarue, J.-M. Lasry, and P.-L. Lions. The Master Equation and the Convergence Problem in Mean Field Games, Princeton University Press, 2019. doi: 10.23943/princeton/9780691190716.001.0001.
    [67] P. Cardaliaguet and S. Hadikhanloo., Learning in mean field games: the fictitious play., ESAIM: Control, Optimisation and Calculus of Variations, 23(2): 569–591, 2017. doi: 10.1051/cocv/2016004.
    [68] P. Cardaliaguet and C.-A. Lehalle, Mean field game of controls and an application to trade crowding, Mathematics and Financial Economics, 12 (2018), 335-363.  doi: 10.1007/s11579-017-0206-z.
    [69] P. Cardaliaguet and C. Rainer, On the (in) efficiency of MFG equilibria, SIAM Journal on Control and Optimization, 57 (2019), 2292-2314.  doi: 10.1137/18M1172363.
    [70] E. Carlini and F. J. Silva, A fully discrete semi-Lagrangian scheme for a first order mean field game problem, SIAM Journal on Numerical Analysis, 52 (2014), 45-67.  doi: 10.1137/120902987.
    [71] E. Carlini and F. J. Silva, A semi-Lagrangian scheme for a degenerate second order mean field game system, Discrete Contin. Dyn. Syst., 35 (2015), 4269-4292.  doi: 10.3934/dcds.2015.35.4269.
    [72] E. Carlini and F. J. Silva, On the discretization of some nonlinear Fokker–Planck–Kolmogorov equations and applications, SIAM Journal on Numerical Analysis, 56 (2018), 2148-2177.  doi: 10.1137/17M1143022.
    [73] R. Carmona and F. Delarue, Probabilistic analysis of mean-field games, SIAM Journal on Control and Optimization, 51 (2013), 2705-2734.  doi: 10.1137/120883499.
    [74] R. CarmonaJ.-P. Fouque and L.-H. Sun, Mean field games and systemic risk, Communications in Mathematical Sciences, 13 (2015), 911-933.  doi: 10.4310/CMS.2015.v13.n4.a4.
    [75] R. Carmona, Lectures on BSDEs, Stochastic Control, and Stochastic Differential Games with Financial Applications, SIAM, 2016. doi: 10.1137/1.9781611974249.
    [76] R. CarmonaF. Delarue and A. Lachapelle, Control of McKean-Vlasov dynamics versus mean field games, Math. Financ. Econ., 7 (2013), 131-166.  doi: 10.1007/s11579-012-0089-y.
    [77] R. Carmona and F. Delarue, The master equation for large population equilibriums, In Stochastic Analysis and Applications, Springer, (2014), 77-128. doi: 10.1007/978-3-319-11292-3_4.
    [78] R. Carmona and F. Delarue, Probabilistic Theory of Mean Field Games with Applications I, Springer, 2018. doi: 10.1007/978-3-319-56436-4.
    [79] R. Carmona and F. Delarue, Probabilistic Theory of Mean Field Games with Applications II, Springer, 2018. doi: 10.1007/978-3-319-56436-4.
    [80] R. Carmona, C. V. Graves, and Z. Tan, Price of anarchy for mean field games, In CEMRACS 2017 Numerical Methods for Stochastic Models: Control, Uncertainty Quantification, Mean-Field, volume 65 of ESAIM Proc. Surveys, EDP Sci., Les Ulis, (2019), 349-383. doi: 10.1051/proc/201965349.
    [81] R. Carmona and D. Lacker, A probabilistic weak formulation of mean field games and applications, Ann. Appl. Probab., 25 (2015), 1189-1231.  doi: 10.1214/14-AAP1020.
    [82] R. Carmona and M. Laurière, Convergence Analysis of Machine Learning Algorithms for the Numerical Solution of Mean Field Control and Games I: The Ergodic Case, SIAM Journal on Numerical Analysis, 59 (2021), 1455-1485.  doi: 10.1137/19M1274377.
    [83] R. Carmona and M. Laurière, Convergence analysis of machine learning algorithms for the numerical solution of mean field control and games: II the finite horizon case, The Annals of Applied Probability, 32 (2022), 4065-4105.  doi: 10.1214/21-AAP1715.
    [84] R. Carmona and M. Laurière, Deep learning for mean field games and mean field control with applications to finance, Machine Learning and Data Sciences for Financial Markets: A Guide to Contemporary Practices, (2023), page 369. doi: 10.1017/9781009028943.021.
    [85] R. Carmona, M. Laurière and Z. Tan, Linear-quadratic mean-field reinforcement learning: convergence of policy gradient methods, arXiv: 1910.04295, 2019.
    [86] R. CarmonaM. Laurière and Z. Tan, Model-free mean-field reinforcement learning: mean-field mdp and mean-field q-learning, The Annals of Applied Probability, 33 (2023), 5334-5381.  doi: 10.1214/23-AAP1949.
    [87] R. Carmona and L. Leal, Optimal Execution with Quadratic Variation Inventories, Technical Report, Princeton University, 2021. doi: 10.2139/ssrn.3836898.
    [88] R. Carmona and K. Webster, The self-financing equation in high frequency markets, Finance & Stochastics, 23 (2019), 729-759.  doi: 10.1007/s00780-019-00398-z.
    [89] A. Cartea and S. Jaimungal, Incorporating order-flow into optimal execution, Math. Financ. Econ., 10 (2016), 339-364.  doi: 10.1007/s11579-016-0162-z.
    [90] P. CasgrainB. Ning and S. Jaimungal, Deep Q-learning for Nash equilibria: Nash-DQN, Applied Mathematical Finance, 29 (2022), 62-78.  doi: 10.1080/1350486X.2022.2136727.
    [91] S. CenC. ChengY. ChenY. Wei and Y. Chi, Fast global convergence of natural policy gradient methods with entropy regularization, Operations Research, 70 (2022), 2563-2578.  doi: 10.1287/opre.2021.2151.
    [92] Q. Chan-Wai-NamJ. Mikael and X. Warin, Machine learning for semi linear PDEs, Journal of Scientific Computing, 79 (2019), 1667-1712.  doi: 10.1007/s10915-019-00908-3.
    [93] J.-F. Chassagneux, D. Crisan and F. Delarue, A probabilistic approach to classical solutions of the master equation for large population equilibria, Forthcoming in Memoirs of the AMS, arXiv: 1411.3009, 2014.
    [94] J.-F. Chassagneux, Linear multistep schemes for bsdes, SIAM Journal on Numerical Analysis, 52 (2014), 2815-2836.  doi: 10.1137/120902951.
    [95] J.-F. Chassagneux, J. Chen and N. Frikha, Deep Runge-Kutta schemes for BSDEs, preprint, arXiv: 2212.14372, 2022.
    [96] J.-F. ChassagneuxD. Crisan and F. Delarue, Numerical method for FBSDEs of McKean–Vlasov type, The Annals of Applied Probability, 29 (2019), 1640-1684. 
    [97] T. Chen, Z. O. Wang, I. Exarchos and E. Theodorou, Large-scale multi-agent deep FBSDEs, In International Conference on Machine Learning, (2021), 1740-1748.
    [98] Z. Chen and P. A. Forsyth, A semi-Lagrangian approach for natural gas storage valuation and optimal operation, SIAM Journal on Scientific Computing, 30 (2008), 339-368.  doi: 10.1137/060672911.
    [99] P. CheriditoH. M. SonerN. Touzi and N. Victoir, Second-order backward stochastic differential equations and fully nonlinear parabolic PDEs, Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 60 (2007), 1081-1110. 
    [100] J. ChessariR. KawaiY. Shinozaki and T. Yamada, Numerical methods for backward stochastic differential equations: A survey, Probability Surveys, 20 (2023), 486-567. 
    [101] D. Chevance, Numerical methods for backward stochastic differential equations, Numerical Methods in Finance, 232 (1997).
    [102] K. Cho, B. van Merriënboer, Ç. Gulçehre, D. Bahdanau, F. Bougares, H. Schwenk and Y. Bengio, Learning phrase representations using RNN encoder–decoder for statistical machine translation, In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), (2014), 1724-1734.
    [103] R. Cont and W. Xiong, Dynamics of market making algorithms in dealer markets: Learning and tacit collusion, Mathematical Finance, 2022.
    [104] K. Cui and H. Koeppl, Approximately solving mean field games via entropy-regularized deep reinforcement learning, In Proc. of AISTATS, 2021.
    [105] J. CvitanićD. Possamaï and N. Touzi, Dynamic programming approach to principal-agent problems, Finance Stoch., 22 (2018), 1-37.  doi: 10.1007/s00780-017-0344-4.
    [106] J. Cvitanić and J. Zhang, Contract Theory in Continuous-Time Models, Springer Finance. Springer, Heidelberg, 2013.
    [107] G. Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of Control, Signals and Systems, 2 (1989), 303-314. 
    [108] C. DaskalakisP. W. Goldberg and C. H. Papadimitriou, The complexity of computing a Nash equilibrium, SIAM Journal on Computing, 39 (2009), 195-259.  doi: 10.1137/070699652.
    [109] A. Davey and H. Zheng, Deep learning for constrained utility maximisation, Methodology and Computing in Applied Probability, 24 (2022), 661-692. 
    [110] T. De Ryck and S. Mishra, Error analysis for physics-informed neural networks (PINNs) approximating Kolmogorov PDEs, Advances in Computational Mathematics, 48 (2022), 1-40. 
    [111] T. De RyckA. D. Jagtap and S. Mishra, Error estimates for physics-informed neural networks approximating the navier–stokes equations, IMA Journal of Numerical Analysis, 44 (2024), 83-119.  doi: 10.1093/imanum/drac085.
    [112] T. De Ryck and S. Mishra, Generic bounds on the approximation error for physics-informed (and) operator learning, Advances in Neural Information Processing Systems, 35 (2022), 10945-10958. 
    [113] K. Debrabant and E. R. Jakobsen, Semi-lagrangian schemes for linear and fully non-linear diffusion equations, Math. Comput., 82 (2013), 1433-1462.  doi: 10.1090/S0025-5718-2012-02632-9.
    [114] T. Degris, M. White and R. S. Sutton, Off-policy actor-critic, In Proceedings of the 29th International Coference on International Conference on Machine Learning, (2012), 179-186.
    [115] F. Delarue and A. Vasileiadis, Exploration noise for learning linear-quadratic mean field games, preprint, arXiv: 2107.00839, 2021.
    [116] V. Duarte, D. Duarte and D. Silva, Machine learning for continuous-time finance, 2024.
    [117] W. E and B. Yu, The deep Ritz method: A deep learning-based numerical algorithm for solving variational problems, Communications in Mathematics and Statistics, 6 (2018), 1-12. 
    [118] W. EJ. Han and A. Jentzen, Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations, Commun. Math. Stat., 5 (2017), 349-380.  doi: 10.1007/s40304-017-0117-6.
    [119] W. EM. HutzenthalerA. Jentzen and T. Kruse, On multilevel Picard numerical approximations for high-dimensional nonlinear parabolic partial differential equations and high-dimensional nonlinear backward stochastic differential equations, Journal of Scientific Computing, 79 (2019), 1534-1571.  doi: 10.1007/s10915-018-00903-0.
    [120] R. ElieJ. PérolatM. LaurièreM. Geist and O. Pietquin, On the convergence of model free learning in mean field games, Proceedings of the AAAI Conference on Artificial Intelligence, 34 (2020), 7143-7150.  doi: 10.1609/aaai.v34i05.6203.
    [121] I. Elsanosi and B. Larssen, Optimal consumption under partial observations for a stochastic system with delay, Preprint Series. Pure Mathematics, http://urn.nb.no/URN: NBN: no-8076, 2001.
    [122] I. ExarchosE. Theodorou and P. Tsiotras, Stochastic differential games: A sampling approach via fbsdes, Dynamic Games and Applications, 9 (2019), 486-505.  doi: 10.1007/s13235-018-0268-4.
    [123] M. Falcone, Numerical methods for differential games based on partial differential equations, International Game Theory Review, 8 (2006), 231-272. 
    [124] M. Falcone and R. Ferretti, Numerical methods for hamilton–jacobi type equations, Handbook of Numerical Analysis, 17 (2016), 603-626. 
    [125] A.-M. Farahmand, C. Szepesvári and R. Munos, Error propagation for approximate policy and value iteration, Advances in Neural Information Processing Systems, 23 (2010).
    [126] M. Fazel, R. Ge, S. Kakade and M. Mesbahi, Global convergence of policy gradient methods for the linear quadratic regulator, In International Conference on Machine Learning, (2018), 1467-1476.
    [127] S. Federico, A stochastic control problem with delay arising in a pension fund model, Finance and Stochastics, 15 (2011), 421-459.  doi: 10.1007/s00780-010-0146-4.
    [128] X. FengR. Glowinski and M. Neilan, Recent developments in numerical methods for fully nonlinear second order partial differential equations, Siam REVIEW, 55 (2013), 205-267.  doi: 10.1137/110825960.
    [129] D. Firoozi and S. Jaimungal, Exploratory LQG mean field games with entropy regularization, Automatica, 139 (2022), 110177.  doi: 10.1016/j.automatica.2022.110177.
    [130] P. A. Forsyth and G. Labahn, Numerical methods for controlled Hamilton-Jacobi-Bellman PDEs in finance, Journal of Computational Finance, 11 (2007).
    [131] J.-P. Fouque and Z. Zhang, Deep learning methods for mean field control problems with delay, Frontiers in Applied Mathematics and Statistics, 6 (2020).
    [132] C. GaoS. GaoR. Hu and Z. Zhu, Convergence of the backward deep BSDE method with applications to optimal stopping problems, SIAM Journal on Financial Mathematics, 14 (2023), 1290-1303.  doi: 10.1137/22M1539952.
    [133] N. GastB. Gaujal and J.-Y. Le Boudec, Mean field for Markov decision processes: From discrete to continuous optimization, IEEE Transactions on Automatic Control, 57 (2012), 2266-2280.  doi: 10.1109/TAC.2012.2186176.
    [134] M. GermainM. LaurièreH. Pham and X. Warin, DeepSetsand their derivative networks for solving symmetric PDEs, Journal of Scientific Computing, 91 (2022), 1-33. 
    [135] M. Germain, J. Mikael and X. Warin, Numerical resolution of McKean-Vlasov FBSDEs using neural networks, Methodology and Computing in Applied Probability, (2022), 1-30.
    [136] M. Germain, H. Pham and X. Warin, Neural networks-based algorithms for stochastic control and PDEs in finance, Machine Learning And Data Sciences For Financial Markets: A Guide To Contemporary Practices, 2023.
    [137] F. A. GersN. N. Schraudolph and J. Schmidhuber, Learning precise timing with LSTM recurrent networks, Journal of Machine Learning Research, 3 (2002), 115-143. 
    [138] E. Gobet, J.-P. Lemor and X. Warin, A regression-based Monte Carlo method to solve backward stochastic differential equations, Annals of Applied Probability, (2005), 2172-2202.
    [139] E. Gobet and R. Munos, Sensitivity analysis using Itô-Malliavin calculus and martingales, and application to stochastic optimal control, SIAM J. Control Optim., 43 (2005), 1676-1713.  doi: 10.1137/S0363012902419059.
    [140] D. A. GomesS. Patrizi and V. Voskanyan, On the existence of classical solutions for stationary extended mean field games, Nonlinear Analysis: Theory, Methods & Applications, 99 (2014), 49-79. 
    [141] D. A. Gomes and V. K. Voskanyan, Extended deterministic mean-field games, SIAM Journal on Control and Optimization, 54 (2016), 1030-1055. 
    [142] D. Gomes, J. Gutierrez and M. Laurière, Machine learning architectures for price formation models, Applied Mathematics and Optimization, 88 (2023).
    [143] I. GoodfellowY. Bengio and  A. CourvilleDeep Learning, MIT press, 2016. 
    [144] F. Gozzi, S. di Roma and C. Marinelli, Stochastic optimal control of delay equations arising in advertising models, Stochastic Partial Differential Equations and Applications-VII, (2005), 133-148.
    [145] F. GozziC. Marinelli and S. Savin, On controlled linear diffusions with delay in a model of optimal advertising under uncertainty with memory effects, Journal of Optimization Theory and Applications, 142 (2009), 291-321.  doi: 10.1007/s00220-009-0877-2.
    [146] P. J. Graber, Linear quadratic mean field type control and mean field games with common noise, with application to production of an exhaustible resource, Applied Mathematics and Optimization, 74 (2016), 459-486.  doi: 10.1007/s00245-016-9385-x.
    [147] A. Graves, Generating sequences with recurrent neural networks, preprint, arXiv: 1308.0850, 2013.
    [148] A. Graves, A.-r. Mohamed and G. Hinton, Speech recognition with deep recurrent neural networks, In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, (2013), 6645-6649.
    [149] A. Graves and J. Schmidhuber, Offline handwriting recognition with multidimensional recurrent neural networks, In Advances in Neural Information Processing Systems, (2009), 545-552.
    [150] P. GrohsS. IbragimovA. Jentzen and S. Koppensteiner, Lower bounds for artificial neural network approximations: A proof that shallow neural networks fail to overcome the curse of dimensionality, Journal of Complexity, 77 (2023), 101746.  doi: 10.1016/j.jco.2023.101746.
    [151] S. Gronauer and K. Diepold, Multi-agent deep reinforcement learning: A survey, Artificial Intelligence Review, (2021), 1-49.
    [152] H. GuX. GuoX. Wei and R. Xu, Mean-field controls with Q-learning for cooperative MARL: convergence and complexity analysis, SIAM Journal on Mathematics of Data Science, 3 (2021), 1168-1196. 
    [153] H. GuX. GuoX. Wei and R. Xu, Dynamic programming principles for mean-field controls with learning, Operations Research, 71 (2023), 1040-1054.  doi: 10.1287/opre.2022.2395.
    [154] H. Gu, X. Guo, X. Wei and R. Xu, Mean-field multiagent reinforcement learning: A decentralized network approach, Mathematics of Operations Research, 2024.
    [155] X. Guo, A. Hu, R. Xu and J. Zhang, Learning mean-field games, In Advances in Neural Information Processing Systems 32 (NeurIPS), 2019.
    [156] X. Guo, R. Xu and T. Zariphopoulou, Entropy regularization for mean field games with learning, Mathematics of Operations Research, 2022.
    [157] S. Hadikhanloo, Ph. d. paris-dauphine thesis defence document, 2018.
    [158] S. Hadikhanloo and F. J. Silva, Finite mean field games: Fictitious play and convergence to a first order continuous mean field game, Journal de Mathématiques Pures et Appliquées, 132 (2019), 369-397.
    [159] B. HamblyR. Xu and H. Yang, Policy gradient methods for the noisy linear quadratic regulator over a finite horizon, SIAM Journal on Control and Optimization, 59 (2021), 3359-3391. 
    [160] B. HamblyR. Xu and H. Yang, Recent advances in reinforcement learning in finance, Mathematical Finance, 33 (2023), 437-503.  doi: 10.1111/mafi.12382.
    [161] J. HanR. Hu and J. Long, Convergence of deep fictitious play for stochastic differential games, Frontiers of Mathematical Finance, 1 (2022), 287-319.  doi: 10.3934/fmf.2021011.
    [162] J. HanR. Hu and J. Long, A class of dimensionality-free metrics for the convergence of empirical measures, Stochastic Processes and their Applications, 164 (2023), 242-287. 
    [163] J. Han, R. Hu and J. Long, Learning high-dimensional McKean-Vlasov forward-backward stochastic differential equations with general distribution dependence, SIAM Journal on Numerical Analysis, accepted, 2023.
    [164] J. Han and W. E, Deep learning approximation for stochastic control problems, Deep Reinforcement Learning Workshop, NIPS, arXiv preprint arXiv: 1611.07422, 2016.
    [165] J. Han and R. Hu, Deep fictitious play for finding Markovian Nash equilibrium in multi-agent games, Mathematical and Scientific Machine Learning (MSML), 107 (2020), 221-245. 
    [166] J. Han and R. Hu, Recurrent neural networks for stochastic control problems with delay, Mathematics of Control, Signals, and Systems, 33 (2021), 775-795. 
    [167] J. HanA. Jentzen and W. E., Solving high-dimensional partial differential equations using deep learning, Proceedings of the National Academy of Sciences, 115 (2018), 8505-8510. 
    [168] J. Han and J. Long, Convergence of the deep BSDE method for coupled FBSDEs, Probability, Uncertainty and Quantitative Risk, 5 (2020), 1-33. 
    [169] J. Han, Y. Yang and W. E, DeepHAM: A global solution method for heterogeneous agent models with aggregate shocks, preprint, arXiv: 2112.14377, 2021.
    [170] B. Hanin, Universal function approximation by deep neural nets with bounded width and ReLu activations, Mathematics, 7 (2019), 992.  doi: 10.3390/math7100992.
    [171] B. Hanin and D. Rolnick, Deep ReLu networks have surprisingly few activation patterns, Advances in Neural Information Processing Systems, 32, 2019.
    [172] B. Hanin and M. Sellke, Approximating continuous functions by ReLu nets of minimal width, preprint, arXiv: 1710.11278, 2017.
    [173] P. Henry-Labordere, Deep primal-dual algorithm for BSDEs: Applications of machine learning to CVA and IM, Available at SSRN 3071506, 2017.
    [174] P. Henry-LabordereC. Litterer and Z. Ren, A dual algorithm for stochastic control problems: Applications to uncertain volatility models and CVA, SIAM Journal on Financial Mathematics, 7 (2016), 159-182.  doi: 10.1137/15M1019945.
    [175] C. F. Higham and D. J. Higham, Deep learning: An introduction for applied mathematicians, SIAM Review, 61 (2019), 860-891.  doi: 10.1137/18M1165748.
    [176] S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Computation, 9(8): 1735–1780, 1997. doi: 10.1162/neco.1997.9.8.1735.
    [177] K. Hornik, Approximation capabilities of multilayer feedforward networks, Neural Networks, 4 (1991), 251-257.  doi: 10.1016/0893-6080(91)90009-T.
    [178] K. HornikM. Stinchcombe and H. White, Multilayer feedforward networks are universal approximators, Neural Networks, 2 (1989), 359-366.  doi: 10.1016/0893-6080(89)90020-8.
    [179] J. Hu and M. P. Wellman, Nash Q-learning for general-sum stochastic games, Journal of Machine Learning Research, 4 (2003), 1039-1069. 
    [180] R. Hu and M. Ludkovski, Sequential design for ranking response surfaces, SIAM/ASA Journal on Uncertainty Quantification, 5 (2017), 212-239.  doi: 10.1137/15M1045168.
    [181] R. Hu, Deep learning for ranking response surfaces with applications to optimal stopping problems, Quantitative Finance, 20 (2020), 1567-1581.  doi: 10.1080/14697688.2020.1741669.
    [182] R. Hu, Deep fictitious play for stochastic differential games, Communications in Mathematical Sciences, 19 (2021), 325-353.  doi: 10.4310/CMS.2021.v19.n2.a2.
    [183] R. Hu and T. Zariphopoulou, $N$-player and mean-field games in Itô-diffusion markets with competitive or homophilous interaction, Stochastic Analysis, Filtering, and Stochastic Optimization: A Commemorative Volume to Honor Mark HA Davis's Contributions, (2022), 209-237. doi: 10.1007/978-3-030-98519-6_9.
    [184] M. HuangP. E. Caines and R. P. Malhamé, Large-population cost-coupled LQG problems with nonuniform agents: Individual-mass behavior and decentralized $\epsilon$-Nash equilibria, IEEE Transactions on Automatic Control, 52 (2007), 1560-1571.  doi: 10.1109/TAC.2007.904450.
    [185] M. HuangR. P. Malhamé and P. E. Caines, Large population stochastic dynamic games: Closed-loop McKean-Vlasov systems and the Nash certainty equivalence principle, Communications in Information and Systems, 6 (2006), 221-252.  doi: 10.4310/CIS.2006.v6.n2.a2.
    [186] K. J. HuntD. SbarbaroR. Żbikowski and P. J. Gawthrop, Neural networks for control systems–A survey, Automatica J. IFAC, 28 (1992), 1083-1112.  doi: 10.1016/0005-1098(92)90053-I.
    [187] C. HuréH. PhamA. Bachouch and N. Langrené, Deep neural networks algorithms for stochastic control problems on finite horizon: Convergence analysis, SIAM Journal on Numerical Analysis, 59 (2021), 525-557.  doi: 10.1137/20M1316640.
    [188] C. HuréH. Pham and X. Warin, Deep backward schemes for high-dimensional nonlinear PDEs, Mathematics of Computation, 89 (2020), 1547-1579.  doi: 10.1090/mcom/3514.
    [189] M. HutzenthalerA. JentzenT. Kruse and T. A. Nguyen, A proof that rectified deep neural networks overcome the curse of dimensionality in the numerical approximation of semilinear heat equations, SN Partial Differential Equations and Applications, 1 (2020), 1-34. 
    [190] R. Isaacs, Differential Games: A Mathematical Theory with Applications to Warfare and Pursuit, Control and Optimization, London: John Wiley and Sons, 1965.
    [191] S. JaimungalS. M. PesentiY. S. Wang and H. Tatsat, Robust risk-aware reinforcement learning, SIAM Journal on Financial Mathematics, 13 (2022), 213-226.  doi: 10.1137/21M144640X.
    [192] M. Jensen and I. Smears, On the convergence of finite element methods for Hamilton–Jacobi–Bellman equations, SIAM Journal on Numerical Analysis, 51 (2013), 137-162.  doi: 10.1137/110856198.
    [193] A. JentzenD. Salimova and T. Welti, A proof that deep artificial neural networks overcome the curse of dimensionality in the numerical approximation of Kolmogorov partial differential equations with constant diffusion and nonlinear drift coefficients, Communications in Mathematical Sciences, 19 (2021), 1167-1205.  doi: 10.4310/CMS.2021.v19.n5.a1.
    [194] S. JiS. PengY. Peng and X. Zhang, Three algorithms for solving high-dimensional fully coupled FBSDEs through deep learning, IEEE Intelligent Systems, 35 (2020), 71-84.  doi: 10.1109/MIS.2020.2971597.
    [195] Y. Jia and X. Y. Zhou, Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach, Journal of Machine Learning Research, 23 (2022), 1-55.  doi: 10.2139/ssrn.3905379.
    [196] Y. Jia and X. Y. Zhou, Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms, Journal of Machine Learning Research, 23 (2022), 1-50.  doi: 10.2139/ssrn.3969101.
    [197] Z. JinM. QiuK. Q. Tran and G. Yin, A survey of numerical solutions for stochastic control problems: Some recent progress, Numerical Algebra, Control and Optimization, 12 (2022), 213-253.  doi: 10.3934/naco.2022004.
    [198] L. P. KaelblingM. L. Littman and A. W. Moore, Reinforcement learning: A survey, Journal of Artificial Intelligence Research, 4 (1996), 237-285.  doi: 10.1613/jair.301.
    [199] N. KerivenA. Bietti and S. Vaiter, On the universality of graph neural networks on large random graphs, Advances in Neural Information Processing Systems, 34 (2021), 6960-6971. 
    [200] J. Kierzenka and L. F. Shampine, A BVP solver based on residual control and the Maltab PSE, ACM Transactions on Mathematical Software (TOMS), 27 (2001), 299-316.  doi: 10.1145/502800.502801.
    [201] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint, (2014), arXiv: 1412.6980.
    [202] A. C. Kizilkale and R. P. Malhame, Collective target tracking mean field control for Markovian jump-driven models of electric water heating loads, IFAC Proceedings Volumes, 47 (2014), 1867-1872.  doi: 10.3182/20140824-6-ZA-1003.00630.
    [203] Z. Kobeissi, On classical solutions to the mean field game system of controls, Communications in Partial Differential Equations, 47 (2022), 453-488.  doi: 10.1080/03605302.2021.1985518.
    [204] Z. Kobeissi and F. Bach, On a variance reduction correction of the temporal difference for policy evaluation in the stochastic continuous setting, arXiv preprint, (2022), arXiv: : 2202.07960.
    [205] M. KohlerA. Krzyżak and N. Todorovic, Pricing of high-dimensional American options by neural networks, Mathematical Finance: An International Journal of Mathematics, Statistics and Financial Economics, 20 (2010), 383-410.  doi: 10.1111/j.1467-9965.2010.00404.x.
    [206] M. Kohlmann and X. Y. Zhou, Relationship between backward stochastic differential equations and stochastic controls: A linear-quadratic approach, SIAM J. Control Optim., 38 (2000), 1392-1407.  doi: 10.1137/S036301299834973X.
    [207] V. B. Kolmanovskiĭ and L. E. Shaĭkhet, Control of Systems with Aftereffect, Volume 157. American Mathematical Society, 1996.
    [208] V. N. Kolokoltsov and A. Bensoussan, Mean-field-game model for botnet defense in cyber-security, Appl. Math. Optim., 74 (2016), 669-692.  doi: 10.1007/s00245-016-9389-6.
    [209] H. Kushner and P. Dupuis, Numerical Methods for Stochastic Control Problems in Continuous Time, Applications of Mathematics (New York), 24. Stochastic Modelling and Applied Probability, Springer-Verlag, New York, 2001. doi: 10.1007/978-1-4613-0007-6.
    [210] H. J. Kushner, Numerical methods for stochastic control problems in continuous time, SIAM Journal on Control and Optimization, 28 (1990), 999-1048.  doi: 10.1137/0328056.
    [211] H. J. Kushner, Numerical approximations for stochastic differential games, SIAM Journal on Control and Optimization, 41 (2002), 457-486.  doi: 10.1137/S0363012901389457.
    [212] H. J. Kushner, Numerical approximations for stochastic differential games: The ergodic case, SIAM Journal on Control and Optimization, 42 (2004), 1911-1933.  doi: 10.1137/S0036301290140034.
    [213] H. J. Kushner, Numerical approximations for nonzero-sum stochastic differential games, SIAM Journal on Control and Optimization, 46 (2007), 1942-1971.  doi: 10.1137/050647931.
    [214] F. E. Kydland and E. C. Prescott, Time to build and aggregate fluctuations, Econometrica: Journal of the Econometric Society, (1982), 1345-1370. doi: 10.2307/1913386.
    [215] D. Lacker, Limit theory for controlled McKean–Vlasov dynamics, SIAM Journal on Control and Optimization, 55 (2017), 1641-1672.  doi: 10.1137/16M1095895.
    [216] D. Lacker and A. Soret, Many-player games of optimal consumption and investment under relative performance criteria, Mathematics and Financial Economics, 14 (2020), 263-281.  doi: 10.1007/s11579-019-00255-9.
    [217] D. Lacker and T. Zariphopoulou, Mean field and n-agent games for optimal investment under relative performance criteria, Mathematical Finance, 29 (2019), 1003-1038.  doi: 10.1111/mafi.12206.
    [218] M. Lanctot, V. Zambaldi, A. Gruslys, A. Lazaridou, K. Tuyls, J. Pérolat, D. Silver and T. Graepel, A unified game-theoretic approach to multiagent reinforcement learning, Advances in Neural Information Processing Systems, 30 (2017).
    [219] B. Lapeyre and J. Lelong, Neural network regression for Bermudan option pricing, Monte Carlo Methods and Applications, 27 (2021), 227-247. doi: 10.1515/mcma-2021-2091.
    [220] J.-M. Lasry and P.-L. Lions, Jeux á champ moyen. I. Le cas stationnaire, C. R. Math. Acad. Sci. Paris, 9 (2006), 619-625.  doi: 10.1016/j.crma.2006.09.019.
    [221] J.-M. Lasry and P.-L. Lions, Jeux á champ moyen. II. Horizon fini et contrôle optimal, C. R. Math. Acad. Sci. Paris, 10 (2006), 679-684.  doi: 10.1016/j.crma.2006.09.018.
    [222] J.-M. Lasry and P.-L. Lions, Mean field games, Japanese Journal of Mathematics, 2 (2007), 229-260.  doi: 10.1007/s11537-007-0657-8.
    [223] M. Lauriere, Numerical methods for mean field games and mean field type control, Mean Field Games, Sympos. Appl. Math., Amer. Math. Soc., Providence, RI, 78 (2021), 221-282.  doi: 10.1090/psapm/078/06.
    [224] M. Laurière, S. Perrin, S. Girgin, P. Muller, A. Jain, T. Cabannes, G. Piliouras, J. Pérolat, R. Elie, O. Pietquin, et al., Scalable deep reinforcement learning algorithms for mean field games, International Conference on Machine Learning, (2022), 12078-12095.
    [225] M. Laurière, S. Perrin, J. Pérolat, S. Girgin, P. Muller, R. Élie, M. Geist and O. Pietquin, Learning in mean field games: A survey, arXiv preprint, (2022), arXiv: 2205.12944.
    [226] M. Laurière and O. Pironneau, Dynamic programming for mean-field type control, C. R. Math. Acad. Sci. Paris, 352 (2014), 707-713.  doi: 10.1016/j.crma.2014.07.008.
    [227] M. Lauriere and L. Tangpi, Convergence of large population games to mean field games with interaction through the controls, SIAM Journal on Mathematical Analysis, 54 (2022), 3535-3574.  doi: 10.1137/22M1469328.
    [228] L. LealM. Laurière and C.-A. Lehalle, Learning a functional control for high-frequency finance, Quantitative Finance, 22 (2022), 1973-1987.  doi: 10.1080/14697688.2022.2106885.
    [229] W. Lefebvre and E. Miller, Linear-quadratic stochastic delayed control and deep learning resolution, Journal of Optimization Theory and Applications, 191 (2021), 134-168.  doi: 10.1007/s10957-021-01923-x.
    [230] C.-A. Lehalle and R. Azencott, Piecewise affine neural networks and nonlinear control, International Conference on Artificial Neural Networks, (1998), 633-638. doi: 10.1007/978-1-4471-1599-1_96.
    [231] M. LeshnoV. Y. LinA. Pinkus and S. Schocken, Multilayer feedforward networks with a nonpolynomial activation function can approximate any function, Neural Networks, 6 (1993), 861-867.  doi: 10.1016/S0893-6080(05)80131-5.
    [232] K. Li and J. Liu, Portfolio selection under time delays: A piecewise dynamic programming approach, Available at SSRN 2916481, (2018).
    [233] Y. Li, Deep reinforcement learning: An overview, arXiv preprint, (2017), arXiv: 1701.07274.
    [234] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver and D. Wierstra, Continuous control with deep reinforcement learning, Proceedings of the International Conference on Learning Representations (ICLR 2016), (2016).
    [235] A. T. Lin, S. W. Fung, W. Li, L. Nurbekyan and S. J. Osher, Alternating the population and control neural networks to solve high-dimensional stochastic mean-field games, Proceedings of the National Academy of Sciences, 118 (2021), e2024713118. doi: 10.1073/pnas.2024713118.
    [236] P.-L. Lions, Cours du Collège de France, https://www.college-de-france.fr/site/en-pierre-louis-lions/_course.htm, 2007-2011.
    [237] Z. Lu, H. Pu, F. Wang, Z. Hu and L. Wang, The expressive power of neural networks: A view from the width, Advances in Neural Information Processing Systems, 30 (2017).
    [238] J. Luo and H. Zheng, Deep neural network solution for finite state mean field game with error estimation, Dynamic Games and Applications, 13 (2023), 859-896.  doi: 10.1007/s13235-022-00477-5.
    [239] T. Lyons and  Z. QianSystem Control and Rough Paths, Oxford University Press, 2002.  doi: 10.1093/acprof:oso/9780198506485.001.0001.
    [240] T. J. Lyons, M. Caruana and T. Lévy, Differential Equations Driven by Rough Paths, Springer, 2007. doi: 10.1007/978-3-540-71285-5.
    [241] J. Ma, P. Protter, J. San Martín and S. Torres, Numerical method for backward stochastic differential equations, Annals of Applied Probability, (2002), 302-316. doi: 10.1214/aoap/1015961165.
    [242] J. L. MathieuS. Koch and D. S. Callaway, State estimation and control of electric loads to manage real-time energy imbalance, IEEE Transactions on Power Systems, 28 (2013), 430-440.  doi: 10.1109/TPWRS.2012.2204074.
    [243] M. Min and R. Hu, Signatured deep fictitious play for mean field games with common noise, International Conference on Machine Learning (ICML), PMLR139 (2021), 7736-7747. 
    [244] M. MinR. Hu and T. Ichiba, Directed chain generative adversarial networks, International Conference on Machine Learning (ICML), PMLR 202 (2023), 24812-24830. 
    [245] S. Mishra and R. Molinaro, Estimates on the generalization error of physics-informed neural networks for approximating a class of inverse problems for PDEs, IMA Journal of Numerical Analysis, 42 (2022), 981-1022.  doi: 10.1093/imanum/drab032.
    [246] S. Mishra and R. Molinaro, Estimates on the generalization error of physics-informed neural networks for approximating PDEs, IMA Journal of Numerical Analysis, 43 (2023), 1-43.  doi: 10.1093/imanum/drab093.
    [247] V. MnihK. KavukcuogluD. SilverA. A. RusuJ. VenessM. G. BellemareA. GravesM. RiedmillerA. K. FidjelandG. OstrovskiS. PetersenC. BeattieA. SadikI. AntonoglouH. KingD. KumaranD. WierstraS. Hassabis and D. Legg, Human-level control through deep reinforcement learning, Nature, 518 (2015), 529-533.  doi: 10.1038/nature14236.
    [248] S.-E. A. Mohammed, Stochastic Functional Differential Equations, Volume 99. Pitman Advanced Publishing Program, 1984.
    [249] S.-E. A. Mohammed, Stochastic differential systems with memory: Theory, examples and applications, Stochastic Analysis and Related Topics VI, (1998), 1-77. doi: 10.1007/978-1-4612-2022-0_1.
    [250] M. Motte and H. Pham, Mean-field markov decision processes with common noise and open-loop controls, The Annals of Applied Probability, 32 (2022), 1421-1458.  doi: 10.1214/21-AAP1713.
    [251] R. Munos, Policy gradient in continuous time, Journal of Machine Learning Research, 7 (2006), 771-791. 
    [252] J. Nash, Non-cooperative games, Annals of Mathematics, (1951), 286-295. doi: 10.2307/1969529.
    [253] M. Nutz and Y. Zhang, Conditional optimal stopping: A time-inconsistent optimization, Ann. Appl. Probab., 30 (2020), 1669-1692.  doi: 10.1214/19-AAP1540.
    [254] A. M. Oberman, Convergent difference schemes for degenerate elliptic and parabolic equations: Hamilton–Jacobi equations and free boundary problems, SIAM Journal on Numerical Analysis, 44 (2006), 879-895.  doi: 10.1137/S0036142903435235.
    [255] B. Øksendal and A. Sulem, A maximum principle for optimal control of stochastic systems with delay, with applications to finance, Optimal Control and Partial Differential Equations, IOS, Amsterdam, (2001), 64-79, http://urn.nb.no/URN: NBN: no-8076.
    [256] E. Pardoux and S. Peng, Adapted solution of a backward stochastic differential equation, Control Letters, 14 (1990), 55-61.  doi: 10.1016/0167-6911(90)90082-6.
    [257] S. Park, C. Yun, J. Lee and J. Shin, Minimum width for universal approximation, International Conference on Learning Representations, (2020).
    [258] S. Peng, Stochastic Hamilton–Jacobi–Bellman equations, SIAM Journal on Control and Optimization, 30 (1992), 284-304.  doi: 10.1137/0330018.
    [259] S. PerrinM. LaurièreJ. PérolatR. ÉlieM. Geist and O. Pietquin, Generalization in mean field games by learning master policies, Proceedings of the AAAI Conference on Artificial Intelligence, 36 (2022), 9413-9421.  doi: 10.1609/aaai.v36i9.21173.
    [260] S. Perrin, M. Laurière, J. Pérolat, M. Geist, R. élie and O. Pietquin, Mean field games flock! The reinforcement learning way, Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, International Joint Conferences on Artificial Intelligence Organization, 8 (2021), 356-362. doi: 10.24963/ijcai.2021/50.
    [261] H. Pham, Continuous-Time Stochastic Control and Optimization with Financial Applications, Stochastic Modelling and Applied Probability, 61. Springer-Verlag, Berlin, 2009.
    [262] H. Pham, On some recent aspects of stochastic control and their applications, Probability Surveys, 2 (2005), 506-549.  doi: 10.1214/154957805100000195.
    [263] H. Pham, X. Warin and M. Germain, Neural networks-based backward scheme for fully nonlinear PDEs, SN Partial Differential Equations and Applications, 2 (2021), Paper No. 16, 24 pp. doi: 10.1007/s42985-020-00062-8.
    [264] H. Pham and X. Wei, Dynamic programming for optimal control of stochastic McKean-Vlasov dynamics, SIAM J. Control Optim., 55 (2017), 1069-1101.  doi: 10.1137/16M1071390.
    [265] A. Pinkus, Approximation theory of the MLP model in neural networks, Acta Numerica, 8 (1999), 143-195.  doi: 10.1017/S0962492900002919.
    [266] W. B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality, Volume 703. John Wiley & Sons, 2007. doi: 10.1002/9780470182963.
    [267] D. PsaltisA. Sideris and A. A. Yamamura, A multilayered neural network controller, IEEE Control Systems Magazine, 8 (1988), 17-21.  doi: 10.1109/37.1868.
    [268] M. RaissiP. Perdikaris and G. E. Karniadakis, Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations, Journal of Computational Physics, 378 (2019), 686-707.  doi: 10.1016/j.jcp.2018.10.045.
    [269] C. Reisinger, W. Stockinger and Y. Zhang, A fast iterative PDE-based algorithm for feedback controls of nonsmooth mean-field control problems, SIAM Journal on Scientific Computing, (2024).
    [270] C. Reisinger, W. Stockinger and Y. Zhang, A posteriori error estimates for fully coupled McKean-Vlasov forward-backward SDEs, IMA Journal of Numerical Analysis, (2023). doi: 10.1093/imanum/drad060.
    [271] A. M. Reppen, H. M. Soner and V. Tissot-Daguette, Neural optimal stopping boundary, arXiv preprint, (2022), arXiv: 2205.04595.
    [272] A. M. ReppenH. M. Soner and V. Tissot-Daguette, Deep stochastic optimization in finance, Digital Finance, 5 (2023), 91-111.  doi: 10.1007/s42521-022-00074-6.
    [273] D. E. RumelhartG. E. Hinton and R. J. Williams, Learning representations by back-propagating errors, Nature, 323 (1986), 533-536.  doi: 10.1038/323533a0.
    [274] L. RuthottoS. J. OsherW. LiL. Nurbekyan and S. W. Fung, A machine learning framework for solving high-dimensional mean field game and mean field control problems, Proceedings of the National Academy of Sciences, 117 (2020), 9183-9193.  doi: 10.1073/pnas.1922204117.
    [275] Y. F. Saporito and Z. Zhang, Path-dependent deep Galerkin method: A neural network approach to solve path-dependent partial differential equations, SIAM Journal on Financial Mathematics, 12 (2021), 912-940.  doi: 10.1137/20M1329597.
    [276] A. M. Schäfer and H. G. Zimmermann, Recurrent neural networks are universal approximators, In Artificial Neural Networks–ICANN 2006: 16th International Conference, Athens, Greece, (2006), Springer, 632-640.
    [277] B. ScherrerM. GhavamzadehV. GabillonB. Lesner and M. Geist, Approximate modified policy iteration and its application to the game of tetris, J. Mach. Learn. Res., 16 (2015), 1629-1676. 
    [278] S. Shalev-Shwartz, S. Shammah and A. Shashua, Safe, multi-agent, reinforcement learning for autonomous driving, arXiv: 1610.03295, 2016.
    [279] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of Go with deep neural networks and tree search, Nature, 529 (2016), 7587.
    [280] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, Deterministic policy gradient algorithms, In Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, (2014), 387-395.
    [281] J. Sirignano and K. Spiliopoulos, DGM: A deep learning algorithm for solving partial differential equations, Journal of Computational Physics, 375 (2018), 1339-1364.  doi: 10.1016/j.jcp.2018.08.029.
    [282] I. H. Sloan and H. Woźniakowski, When are quasi-Monte Carlo algorithms efficient for high dimensional integrals?, Journal of Complexity, 14 (1998), 1-33. 
    [283] J. Subramanian and A. Mahajan, Reinforcement learning in stationary mean-field games, In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS '19, Richland, SC, (2019), 251-259.
    [284] R. S. Sutton and  A. G. BartoReinforcement Learning: An Introduction, MIT Press, 2018. 
    [285] A.-S. Sznitman, Topics in propagation of chaos, In Ecole d'été de probabilités de Saint-Flour XIX-1989, Springer, (1991), 165-251.
    [286] K. Tuyls and G. Weiss, Multiagent learning: Basics, challenges, and prospects, AI Magazine, 33 (2012), 41-41.  doi: 10.1609/aimag.v33i3.2426.
    [287] R. van der MeerC. W. Oosterlee and A. Borovykh, Optimally weighted loss functions for solving PDEs with neural networks, Journal of Computational and Applied Mathematics, 405 (2022), 113887. 
    [288] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning, Nature, 575 (2019), 7782.
    [289] H. WangT. Zariphopoulou and X. Y. Zhou, Reinforcement learning in continuous time and space: A stochastic control approach, Journal of Machine Learning Research, 21 (2020), 1-34. 
    [290] H. Wang and X. Y. Zhou, Continuous-time mean–variance portfolio selection: A reinforcement learning framework, Mathematical Finance, 30 (2020), 1273-1308.  doi: 10.1111/mafi.12281.
    [291] L. Wang, Q. Cai, Z. Yang and Z. Wang, Neural policy gradient methods: Global optimality and rates of convergence, In International Conference on Learning Representations, 2019.
    [292] R. J. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Reinforcement Learning, (1992), 5-32.
    [293] Q. Xie, Z. Yang, Z. Wang and A. Minca, Learning while playing in mean-field games: Convergence and optimality, In International Conference on Machine Learning, (2021), 11436-11447.
    [294] Y. XuanR. BalkinJ. HanR. Hu and H. D. Ceniceros, Optimal policies for a pandemic: A stochastic game approach and a deep learning algorithm, Mathematical and Scientific Machine Learning, 145 (2022), 987-1012. 
    [295] Y. XuanR. BalkinJ. HanR. Hu and H. D. Ceniceros, Pandemic control, game theory and machine learning, Notices of the AMS, 69 (2022), 1878-1887. 
    [296] Y. Yang and J. Wang, An overview of multi-agent reinforcement learning from game theoretical perspective, arXiv: 2011.00583, 2020.
    [297] J. Yong, Differential Games: A Concise Introduction, World Scientific, 2014.
    [298] Y. Zang, J. Long, X. Zhang, W. Hu, W. E and J. Han, A machine learning enhanced algorithm for the optimal landing problem., In Mathematical and Scientific Machine Learning, (2022), 319-334.
    [299] J. Zhang, A numerical scheme for BSDEs, The Annals of Applied Probability, 14 (2004), 459-488. 
    [300] K. Zhang, Z. Yang and T. Başar, Multi-agent reinforcement learning: A selective overview of theories and algorithms, Handbook of Reinforcement Learning and Control, (2021), 321-384.
    [301] W. ZhaoL. Chen and S. Peng, A new kind of accurate numerical method for backward stochastic differential equations, SIAM Journal on Scientific Computing, 28 (2006), 1563-1581.  doi: 10.1137/05063341X.
    [302] W. ZhaoG. Zhang and L. Ju, A stable multistep scheme for solving backward stochastic differential equations, SIAM Journal on Numerical Analysis, 48 (2010), 1369-1394.  doi: 10.1137/09076979X.
    [303] D.-X. Zhou, Universality of deep convolutional neural networks, Applied and computational harmonic analysis, 48 (2020), 787-794. 
    [304] M. Zhou, J. Han and J. Lu, Actor-critic method for high dimensional static hamilton–jacobi–bellman partial differential equations based on neural networks, SIAM Journal on Scientific Computing, 43 (2021), A4043-A4066. doi: 10.1137/21M1402303.
    [305] M. Zhou and J. Lu, A policy gradient framework for stochastic optimal control problems with global convergence guarantee, arXiv: 2302.05816, 2023.
    [306] M. Zhou and J. Lu, Solving time-continuous stochastic optimal control problems: Algorithm design and convergence analysis of actor-critic flow, arXiv: 2402.17208, 2024.
  • 加载中

Figures(21)

SHARE

Article Metrics

HTML views(16235) PDF downloads(595) Cited by(0)

Access History

Other Articles By Authors

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return