Article Contents
Article Contents

# Deep learning as optimal control problems: Models and numerical methods

• * Corresponding author: Carola-Bibiane Schönlieb
• We consider recent work of [18] and [9], where deep learning neural networks have been interpreted as discretisations of an optimal control problem subject to an ordinary differential equation constraint. We review the first order conditions for optimality, and the conditions ensuring optimality after discretisation. This leads to a class of algorithms for solving the discrete optimal control problem which guarantee that the corresponding discrete necessary conditions for optimality are fulfilled. The differential equation setting lends itself to learning additional parameters such as the time discretisation. We explore this extension alongside natural constraints (e.g. time steps lie in a simplex). We compare these deep learning algorithms numerically in terms of induced flow and generalisation ability.

Mathematics Subject Classification: Primary: 65L10, 65K10, 68Q32; Secondary: 49J15.

 Citation:

• Figure 1.  The four data sets used in the numerical study

Figure 2.  Learned transformation and classifier for data set donut1d (top) and squares (bottom)

Figure 3.  Snap shots of transformation of features for data set spiral

Figure 4.  Learned transformation with fixed classifier for data set donut1d (top) and spiral (bottom)

Figure 5.  Robustness on random initialisation for transformed data donut2d and linear classifier for two different initialisations

Figure 6.  Function values over the course of the gradient descent iterations for data sets donut1d, donut2d, spiral, squares (left to right and top to bottom). The solid line represents training and the dashed line test data

Figure 7.  Classification accuracy over the course of the gradient descent iterations for data setsdonut1d, donut2d, spiral, squares (left to right and top to bottom). The solid line represents training and the dashed line test data

Figure 8.  Estimated time steps by ResNet/Euler, ODENet and ODENet+simplex for for data sets donut1d, donut2d, spiral, squares (left to right and top to bottom). ODENet+simplex consistently picks two to three time steps and set the rest to zero

Figure 9.  Learned prediction and transformation for different Runge-Kutta methods and data sets spiral (top), donut2d (centre) and squares (bottom). All results are for 15 layers

Figure 10.  Snap shots of the transition from initial to final state through the network with the data set spiral

Figure 11.  Snap shots of the transition from initial to final state through the network with the data set donut2d

Figure 12.  Snap shots of the transition from initial to final state through the network with the data set squares

Figure 13.  Function values over the course of the gradient descent iterations for data sets donut1d, donut2d, spiral, squares (left to right and top to bottom)

Figure 14.  Accuracy (left) and time steps (right) for MNIST100 dataset [28]

Figure 15.  Features of testing examples from MNIST100 dataset [28] and transformed features by four networks under comparison: Net, ResNet, ODENet, ODENet+Simplex (from top to bottom). All networks have 20 layers

Table 1.  Four explicit Runge–Kutta methods: ResNet/Euler, Improved Euler, Kutta(3) and Kutta(4)

•  [1] H. D. Abarbanel, P. J. Rozdeba and S. Shirman, Machine learning: Deepest learning as statistical data assimilation problems, Neural Comput., 30 (2018), 2025-2055.  doi: 10.1162/neco_a_01094. [2] A. A. Agrachev and Y. Sachkov, Control Theory from the Geometric Viewpoint, Encyclopaedia of Mathematical Sciences, 87, Springer-Verlag, Berlin, 2004. doi: 10.1007/978-3-662-06404-7. [3] M. Benning and M. Burger, Modern regularization methods for inverse problems, Acta Numer., 27 (2018), 1-111.  doi: 10.1017/S0962492918000016. [4] C. M. Bishop, Pattern Recognition and Machine Learning, Information Science and Statistics, Springer, New York, 2006. [5] S. Blanes and F. Casas, On the necessity of negative coefficients for operator splitting schemes of order higher than two, Appl. Numer. Math., 54 (2005), 23-37.  doi: 10.1016/j.apnum.2004.10.005. [6] R. Bucy, Two-point boundary value problems of linear Hamiltonian systems, SIAM J. Appl. Math., 15 (1967), 1385-1389.  doi: 10.1137/0115121. [7] M. Burger, G. Gilboa, S. Osher and J. Xu et al., Nonlinear inverse scale space methods, Commun. Math. Sci., 4 (2006), 179-212.  doi: 10.4310/CMS.2006.v4.n1.a7. [8] M. Burger, S. Osher, J. Xu and G. Gilboa, Nonlinear inverse scale space methods for image restoration, in International Workshop on Variational, Geometric, and Level Set Methods in Computer Vision, Lecture Notes in Computer Science, 3752, Springer-Verlag, Berlin, 2005, 25–36. doi: 10.1007/11567646_3. [9] B. Chang, L. Meng, E. Haber, L. Ruthotto, D. Begert and E. Holtham, Reversible architectures for arbitrarily deep residual neural networks, 32nd AAAI Conference on Artificial Intelligence, 2018. [10] T. Q. Chen, Y. Rubanova, J. Bettencourt and D. K. Duvenaud, Neural ordinary differential equations, in Advances in Neural Information Processing Systems, 2018, 6572–6583. [11] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli and Y. Bengio, Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, in Advances in Neural Information Processing Systems, 2014, 2933–2941. [12] A. L. Dontchev and W. W. Hager, The Euler approximation in state constrained optimal control, Math. Comp., 70 (2001), 173-203.  doi: 10.1090/S0025-5718-00-01184-4. [13] J. Duchi, S. Shalev-Shwartz, Y. Singer and T. Chandra, Efficient projections onto the $\ell$1-ball for learning in high dimensions, in Proceedings of the 25th International Conference on Machine Learning, 2008, 272–279. [14] W. E, A proposal on machine learning via dynamical systems, Commun. Math. Stat., 5 (2017), 1-11.  doi: 10.1007/s40304-017-0103-z. [15] F. Gay-Balmaz and T. S. Ratiu, Clebsch optimal control formulation in mechanics, J. Geom. Mech., 3 (2011), 41-79.  doi: 10.3934/jgm.2011.3.41. [16] A. Gholami, K. Keutzer and G. Biros, ANODE: Unconditionally accurate memory-efficient gradients for neural ODEs, preprint, arXiv: math/1902.10298. [17] I. Goodfellow, J. Shlens and C. Szegedy, Explaining and harnessing adversarial examples, preprint, arXiv: math/1412.6572. [18] E. Haber and L. Ruthotto, Stable architectures for deep neural networks, Inverse Problems, 34 (2018), 22pp.  doi: 10.1088/1361-6420/aa9a90. [19] W. W. Hager, Runge-Kutta methods in optimal control and the transformed adjoint system, Numer. Math., 87 (2000), 247-282.  doi: 10.1007/s002110000178. [20] E. Hairer, C. Lubich and G. Wanner, Geometric Numerical Integration: Structure-Preserving Algorithms for Ordinary Differential Equations, Springer Series in Computational Mathematics, 31, Springer-Verlag, Berlin, 2006. doi: 10.1007/3-540-30666-8. [21] K. He, X. Zhang, S. Ren and J. Sun, Identity mappings in deep residual networks, in European Conference on Computer Vision, Lecture Notes in Computer Science, 9908, Springer, Cham, 2016, 630–645. doi: 10.1007/978-3-319-46493-0_38. [22] K. He, X. Zhang, S. Ren and J. Sun, Deep residual learning for image recognition, IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, 2016, 770–778. doi: 10.1109/CVPR.2016.90. [23] C. F. Higham and D. J. Higham, Deep learning: An introduction for applied mathematicians, preprint, arXiv: math/1801.05894. [24] M. Igami, Artificial intelligence as structural estimation: Economic interpretations of Deep Blue, Bonanza, and AlphaGo, preprint, arXiv: math/1710.10967. [25] A. Kurakin, I. Goodfellow and S. Bengio, Adversarial examples in the physical world, preprint, arXiv: math/1607.02533. [26] W. Kutta, Beitrag zur näherungsweisen integration totaler differentialgleichungen, Z. Math. Phys., 46 (1901), 435-453. [27] Y. LeCun, Y. Bengio and G. Hinton, Deep learning, Nature, 521 (2015), 436-444.  doi: 10.1038/nature14539. [28] Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, 86, 1998, 2278–2324. doi: 10.1109/5.726791. [29] Y. LeCun, A theoretical framework for back-propagation, Proceedings of the 1988 Connectionist Models Summer School, 1, CMU, Pittsburgh, PA, 1988, 21–28. [30] Q. Li, L. Chen, C. Tai and E. Weinan, Maximum principle based algorithms for deep learning, J. Mach. Learn. Res., 18 (2017), 5998-6026. [31] Q. Li and S. Hao, An optimal control approach to deep learning and applications to discrete-weight neural networks, preprint, arXiv: math/1803.01299. doi: 10.1109/TNNLS.2017.2665555. [32] Y. Lu, A. Zhong, Q. Li and B. Dong, Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations, in Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, 80, Stockholm, Sweden, 2018, 3276–3285. Available at: http://proceedings.mlr.press/v80/lu18d.html. [33] S.-M. Moosavi-Dezfooli, A. Fawzi and P. Frossard, Deepfool: A simple and accurate method to fool deep neural networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, 2574–2582. doi: 10.1109/CVPR.2016.282. [34] D. L. Phillips, A technique for the numerical solution of certain integral equations of the first kind, J. Assoc. Comput. Mach., 9 (1962), 84-97.  doi: 10.1145/321105.321114. [35] D. C. Plaut et al., Experiments on learning by back propagation, work in progress. [36] L. Pontryagin et. al, Selected Works: The Mathematical Theory of Optimal Processes, Classics of Soviet Mathematics, 4, Gordon & Breach Science Publishers, New York, 1986. [37] I. M. Ross, A roadmap for optimal control: The right way to commute, Ann. N.Y. Acad. Sci., 1065 (2005), 210-231.  doi: 10.1196/annals.1370.015. [38] F. Santosa and W. W. Symes, Linear inversion of band-limited reflection seismograms, SIAM J. Sci. Statist. Comput., 7 (1986), 1307-1330.  doi: 10.1137/0907087. [39] J. M. Sanz-Serna, Symplectic Runge-Kutta schemes for adjoint equations automatic differentiation, optimal control and more, SIAM Rev., 58 (2016), 3-33.  doi: 10.1137/151002769. [40] O. Scherzer and C. Groetsch, Inverse scale space theory for inverse problems, in International Conference on Scale-Space Theories in Computer Vision, Lecture Notes in Computer Science, 2106, Springer, Berlin, 2001, 317–325. doi: 10.1007/3-540-47778-0_29. [41] S. Sonoda and N. Murata, Double continuum limit of deep neural networks, ICML Workshop Principled Approaches to Deep Learning, 2017. [42] E. D. Sontag, Mathematical Control Theory: Deterministic Finite-Dimensional Systems, Texts in Applied Mathematics, 6, Springer-Verlag, New York, 1998. doi: 10.1007/978-1-4612-0577-7. [43] I. Sutskever, J. Martens, G. E. Dahl and G. E. Hinton, On the importance of initialization and momentum in deep learning, Proceedings of the 30th International Conference on Machine Learning, 28, 2013, 1139–1147. [44] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow and R. Fergus, Intriguing properties of neural networks, International Conference on Learning Representations, 2014. [45] M. Thorpe and Y. van Gennip, Deep limits of residual neural networks, preprint, arXiv: math/1810.11741. [46] R. Tibshirani, Regression shrinkage and selection via the lasso, J. Roy. Statist. Soc. Ser. B, 58 (1996), 267-288.  doi: 10.1111/j.2517-6161.1996.tb02080.x. [47] A. N. Tikhonov, Solution of incorrectly formulated problems and the regularization method, in Dokl. Akad. Nauk., 151, 1963, 1035–1038. [48] E. Weinan, J. Han and Q. Li, A mean-field optimal control formulation of deep learning, Res. Math. Sci., 6 (2019), 41pp.  doi: 10.1007/s40687-018-0172-y.

Figures(15)

Tables(1)