Deep learning as optimal control problems: models and numerical methods

We consider recent work of Haber and Ruthotto 2017 and Chang et al. 2018, where deep learning neural networks have been interpreted as discretisations of an optimal control problem subject to an ordinary differential equation constraint. We review the first order conditions for optimality, and the conditions ensuring optimality after discretization. This leads to a class of algorithms for solving the discrete optimal control problem which guarantee that the corresponding discrete necessary conditions for optimality are fulfilled. We discuss two different deep learning algorithms and make a preliminary analysis of the ability of the algorithms to generalise.


Introduction
Deep learning has had a transformative impact on a wide range of tasks related to Artificial Intelligence, ranging from computer vision and speech recognition to playing games [17,20].
Despite impressive results in applications, the mechanisms behind deep learning remain rather mysterious, resulting in deep neural networks mostly acting as black-box algorithms. Consequently, also theoretical guarantees for deep learning are scarce, with major open problems residing in the mathematical sciences. An example are questions around the stability of training as well as the design of stable architectures. These questions are fed by results on the possible instabilities of the training (due to the high-dimensional nature of the problem in combination with its non-convexity) [29,7] which are connected to the lack of generalizability of the learned architecture, and adversarial vulnerability of trained networks [30] that can result in instabilities in the solution and gives rise to systematic attacks with which networks can be fooled [10,18,23]. In this work we want to shed light on these issues by interpreting deep learning as an optimal control problem in the context of binary classification problems. Our work is mostly inspired by a very early paper by LeCun [21], and a series of recent works by Haber, Ruthotto et al. [11,6].
Binary classification in machine learning: Classification is a key task in machine learning; the goal is to learn functions, also known as classifiers, that map their input arguments onto a discrete set of labels that are associated with a particular 1 arXiv:1904.05657v1 [math.OC] 11 Apr 2019 class. A simple example is image classification, where the input arguments are images that depict certain objects, and the classifier aims to identify the class to which the object depicted in the image belongs to. We can model such a classifier as a function g : R n → {c 0 , c 1 , . . . , c K−1 } that takes n-dimensional real-valued vectors and maps them onto a discrete set of K class labels. Note that despite using numerical values, there is no particular ordering of the class labels. The special case of K = 2 classes (and class labels) is known as binary classification; for the remainder of this paper we strictly focus on this binary classification case.
In supervised machine learning, the key idea is to find a classifier by estimating optimal parameters of a parametric function given pairs of data samples {(x i , c i )} m i=1 , for c i ∈ {c 0 , c 1 }, and subsequently defining a suitable classifier that is parametrized with these parameters. The process of finding suitable parameters is usually formulated as a generalised regression problem, i.e. we estimate parameters u, W, µ by minimising a cost function of the form with respect to u, W and µ.
Here h is a model function parametrized by parameters u that transforms inputs x i ∈ R n onto n-dimensional outputs. The vector W ∈ R 1×n is a weight vector that weights this n-dimensional model output, whereas µ ∈ R is a scalar that allows a bias of the weighted model output, and C : R n → R is the so-called hypothesis-function (cf. [14,11]) that maps this weighted and biased model output to a scalar value that can be compared to the class label c i . The function R is a regularisation function that is chosen to ensure some form of regularity of the parameters u and existence of parameters that minimise (1). Note that minimising (1) yields in parameters that minimise the deviation of the output of the hypothesis function and the given labels. If we denote those parameters that minimise (1) byŴ ,μ andû, and if the hypothesis function C maps directly onto the discrete set {c 0 , c 1 }, then a suitable classifier can simply be defined via However, in practice the hypothesis function is often rather continuous and does not map directly on the discrete values {c 0 , c 1 }. In this scenario, a classifier can be defined by subsequent thresholding. Let c 0 and c 1 be real numbers and w.l.o.g. c 0 < c 1 , then a suitable classifier can for instance be defined via Deep learning as an optimal control problem: One recent proposal towards the design of deep neural network architectures is [11,6,31,22]. There, the authors propose an interpretation of deep learning by the popular Residual neural Network (ResNet) architecture [15] as discrete optimal control problems. Let where K [j] is a n × n matrix of weights, β [j] represents the biases, and N is the number of layers. In order to use ResNet for binary classification we can define the output of the model function h in (1)  n , this implies that the classification problem (1) can be written as a constraint minimisation problem of the form subject to the constraint Here ∆t is a parameter which for simplicity at this stage can be chosen to be equal to 1 and whose role will become clear in what follows. The constraint (3) is the ResNet parametrization of a neural network [15]. In contrast, the widely used feed-forward network that we will also investigate later is given by For deep learning algorithms, one often has where σ is a suitable activation function acting component-wise on its arguments. For a more extensive mathematical introduction to deep learning we recommend [16].
Suppose, in what follows, that y i = y i (t) and u = u(t) = (K(t), β(t)), t ∈ [0, T ], are functions of time and y [j] i ≈ y i (t j ). To view (2) and (3) as a discretization of an optimal control problem [6], one observes that the constraint equation (3) is the discretization of the ordinary differential equation (ODE)ẏ i = f (y i , u), y i (0) = x i , on [0, T ], with step-size ∆t and with the forward Euler method. In the continuum, the following optimal control problem is obtained [6], subject to the ODE constrainṫ Assuming that problem (6)-(7) satisfies necessary conditions for optimality [28, ch. 9], and that a suitable activation function and cost function have been chosen, a number of new deep learning algorithms can be generated. These new strategies are obtained by considering constraint ODEs (7) with different structural properties, e.g. taking f to be a Hamiltonian vector field, and by choosing accordingly the numerical integration methods to approximate (7), [6,11]. However, these different architectures need further study and more rigorous justification.
Our contribution: The main purpose of this paper is the investigation of different discretisations of the underlying continuous deep learning problem (6)- (7). In [11,6] the authors investigate different ODE-discretisations for the neural network (7), with a focus on deriving a neural network that describes a 'stable' flow, i.e. the solution y(T ) should be bounded by the initial condition y(0).
Our point of departure from the state-of-the-art will be to outline the well established theory on optimal control, numerical ODE problems based on [12,25,26], where we investigate the complete optimal control problem (6)-(7) under the assumption of necessary conditions for optimality [28, ch. 9].
The formulation of the deep learning problem (2)-(3) is a first-discretize-then optimise approach to optimal control, where ODE (7) is first discretized with a forward Euler method to yield an optimisation problem which is then solved with gradient descent (direct method). In this setting the forward Euler method could be easily replaced by a different and more accurate integration method, but the backpropagation for computing the gradients of the discretized objective function will typically become more complicated.
Here, we propose a first-optimise-then discretize approach for deriving new deep learning algorithms. There is a two-point boundary value Hamiltonian problem associated to (6)-(7) expressing first order optimality conditions of the optimal control problem [24]. This boundary value problem consists of (7), with y i (0) = x, together with its adjoint equation with boundary value at final time T , and in addition an algebraic constraint. In the first-optimise-then-discretize approach, this boundary value problem is solved by a numerical integration method. It is natural to solve equation (7) forward in time with a Runge-Kutta method (with non vanishing weights b i , i = 1, . . . , s), while the adjoint equation must be solved backward in time and with a matching Runge-Kutta method (with weights satisfying (19)) and imposing the constraints at each time step. If the combination of the forward integration method and its counterpart used backwards in time form a symplectic partitioned Runge-Kutta method then the overall discretization is equivalent to a first-discretize-then-optimize approach, but with an efficient and automatic computation of the gradients [12,26].
We implement discretization strategies based on different Runge-Kutta methods for (6)- (7). To make the various methods comparable to each other, we use the same learned parameters for every Runge-Kutta stage, in this way the total the number of parameters will not depend on how many stages each method has. The discretizations are adaptive in time, and learning the step-sizes the number of layers is determined automatically by the algorithms. From the optimal control formulation we derive different instances of deep learning algorithms (2)-(3) by numerical discretizations of the first-order optimality conditions using a partitioned Runge-Kutta method.
Outline of the paper: In Section 2 we derive the optimal control formulation of (6)-(7) and discuss its main properties. In particular, we derive the variational equation, the adjoint equation, the associated Hamiltonian and first-order optimality conditions. Different instances of the deep learning algorithm (2)-(3) are derived in Section 3 by using symplectic partitioned Runge-Kutta methods applied to the constraint equation (9), and the adjoint equations (12) and (13). Using partitioned Runge-Kutta discretization guarantees equivalence of the resulting optimality system to the one derived from a first-discretize-then-optimise approach using gradient descent, cf. Proposition 3.1. In Section 4 we derive several new deep learning algorithms from such optimal control discretisations, and investigate by numerical experiments their dynamics in Section 5 on a selection of toy problems for binary classification in two dimensions.

Properties of the optimal control problem
In this section we review established literature on optimal control which justifies the use of the numerical methods of the next section.

Variational equation.
In this section, we consider a slightly simplified formulation of (6)-(7). In particular, for simplicity we discard the term R(u) in (6), and remove the index "i" in (6)-(7) and the summation over the number of data points. Moreover, as we here focus on the ODE (7), we also remove the dependency on the classification parameters W and µ for now. We rewrite the optimal control problem in the simpler form min subject to the ODE constraintẏ Then, the variational equation for (8)-(9) reads where ∂ y f is the Jacobian of f with respect to y, ∂ u f is the Jacobian of f with respect to u, and v is the variation in y, while w is the variation in u 1 . Since y(0) = x is fixed, v(0) = 0.

Adjoint equation.
The adjoint of (10) is a system of ODEs for a variable p(t), obtained assuming Then (11) implies an integration-by-parts formula which together with (10) leads to the following equation for p: with constraint see [26]. Here we have denoted by (∂ y f ) T the transpose of ∂ y f with respect to the Euclidean inner product ·, · , and similarly (∂ u f ) T is the transpose of ∂ u f .

Associated Hamiltonian system.
For such an optimal control problem, there is an associated Hamiltonian system with Hamiltonian where we recognise that the first equationẏ = ∂ p H coincides with (9), the seconḋ p = −∂ y H with (12) and the third ∂ u H = 0 with (13). The constraint Hamiltonian system is a differential algebraic equation of index one if the Hessian ∂ u,u H is invertible. In this case, by the implicit function theorem there exists ϕ such that u = ϕ(y, p), andH(y, p) = H(y, p, ϕ(y, p)), where the differential algebraic Hamiltonian system is transformed into a canonical Hamiltonian system of ODEs with HamiltonianH. Notice that it is important to know that ϕ exists, but it is not necessary to compute ϕ explicitly for discretizing the problem.

2.4.
First order necessary conditions for optimality. The solution of the two point boundary value problem (9) and (12), (13) with y(0) = x and has the following property so the variation v(T ) is orthogonal to the gradient of the cost function ∂ y J (y)| y=y(T ) . This means that the solution (y(t), v(t), p(t)) satisfies the first order necessary conditions for extrema of J (Pontryagin maximum principle) [24], see also [28, ch. 9 3. Numerical discretization of the optimal control problem We consider a time discrete setting y [0] , y [1] , . . . , y [N ] , u [0] , . . . , u [N −1] and a cost function J (y [N ] ), assuming to apply a numerical time-discretization y (9), the discrete optimal control problem becomes Here the subscript ∆t denotes the discretization step-size of the time interval [0, T ]. This discrete optimal control problem corresponds to a deep learning algorithm with the outlined choices for f and J , see for example [11].
We assume that Φ ∆t is a Runge-Kutta method with non vanishing weights for the discretization of (9). Applying a Runge-Kutta method to (9), for example the forward Euler method, we obtain and taking variations , for ξ → 0 one readily obtains the same Runge-Kutta method applied to the variational equation This means that taking variations is an operation that commutes with applying Runge-Kutta methods. This is a well known property of Runge-Kutta methods and also of the larger class of so called B-series methods, see for example [13,ch. VI.4,p. 191] for details.
In order to ensure that the first order necessary conditions for optimality from Section 2.4 are satisfied also after discretization, fixing a certain Runge-Kutta method Φ ∆t for (9), we need to discretize the adjoint equations (12) and (13) such that the overall method is a symplectic, partitioned Runge-Kutta method for the system spanned by (9), (12) and (13). This will in particular guarantee the preservation of the quadratic invariant (11), as emphasised in [26]. The general format of a partitioned Runge-Kutta method as applied to (9), (12) and (13) is for j = 0, . . . , N − 1 and boundary conditions y [0] = x, p [N ] := ∂J (y [N ] ). We will assume b i = 0, i = 1, . . . , s. 2 It is well known [13] that if the coefficients of a partitioned Runge-Kutta satisfy then the partitioned Runge-Kutta preserves invariants of the form where S is bilinear. As a consequence the invariant S(v(t), p(t)) := p(t), v(t) (11) will be preserved by such method. These partitioned Runge-Kutta methods are called symplectic. The simplest symplectic partitioned Runge-Kutta method is the symplectic Euler method, which is a combination of the explicit Euler method b 1 = 1, a 1,1 = 0, c 1 = 0 and the implicit Euler methodb 1 = 1,ã 1,1 = 1c 1 = 1. This method applied to (9), (12) and (13) gives subject to are satisfied.
Proof. See appendix A.
In (16)- (18) it is assumed that there is a parameter set u   (16) and (17) respectively. Then the gradient of the cost function J with respect to the controls is given by 1 can be computed explicitly from (24a). Remark 3.4. For the explicit Euler method, these formulas greatly simplify and the derivative of the cost function with respect to the controls can be computed as

Optimal Control motivated Neural Network Architectures
An ODE-inspired neural network architecture is uniquely defined by choosing f and specifying a time discretization of (9). Here we will focus on the common choice f (u, y) = σ(Ky + β), u = (K, β) (e.g. ResNet) and also discuss a novel option f (u, y) = α σ(Ky + β), u = (K, β, α) which we will refer to as ODENet.
Here we choose f (u, y) = σ(Ky + β), u = (K, β). For simplicity we focus on the simplest Runge-Kutta method-the explicit Euler. This corresponds to the ResNet in the machine learning literature. In this case the network relation (forward propagation) is given by and gradients with respect to the controls can be computed by first solving for the adjoint variable (backpropagation) and then computing 4.2. ODENet. In contrast to the models we discussed so far, we can also enlarge the set of controls to model varying time steps. Let u = (K, β, α) and define f (u, y) = α σ(Ky + β) .
The function α can be interpreted as varying time steps. Then the network relation (forward propagation) is given by Algorithm 1 Training ODE-inspired neural networks with gradient descent. Input: initial guess for the controls u, step size τ 1: for k = 1, . . . do 2: forward propagation: compute y via (16) 3: backpropagation: compute p via (17) 4: compute gradient g via (24) and (41) 5: update controls: u = u − τ g and gradients with respect to the controls can be computed by first solving for the adjoint variable (backpropagation) and then computing It is natural to assume the learned time steps α should lie in the set of probability distributions or discretized in the simplex This discretized constraint can easily be incorporated into the learning process by projecting the gradient descent iterates onto the constraint set S. Efficient finite-time algorithms are readily available and trivial to implement [8].

Setting, Training and Data sets.
Throughout the numerical experiments we use labels c i ∈ {0, 1} and make use of the link function σ(x) = tanh(x) and hypothesis function H(x) = 1/(1 + exp(−x)). For all experiments we use two channels (n = 2) but vary the number of layers N . In all numerical experiments we use gradient descent with backtracking, see Algorithm 2, to train the network (estimate the controls). The algorithm requires the   Algorithm 2 Training ODE-inspired neural networks with gradient descent and backtracking. Input: initial guess for controls u and parameter L, hyperparameters ρ > 1 and ρ < 1. ifφ ≤ φ + g,ũ − u + L 2 ũ − u 2 then 9: accept: u =ũ, y =ỹ, φ =φ, L = ρL 10: break inner loop 11: else reject: L = ρL derivatives with respect to the controls which we derived in the previous section. Finally, the gradients with respect to W and µ of the discrete cost function are required in order to update these parameters with gradient descent, which read as We consider 4 different data sets (donut1d, donut2d, squares, spiral) that have different topological properties, which are illustrated in Figure 1. These are samples from a random variable with prescribed probability density functions. We use 500 samples for data set donut1d and each 1,000 for the other three data sets.

Comparison of Optimal Control Inspired Methods.
We start by comparing qualitative and quantitative properties of four different methods. These are: 1) the standard neural network approach ((4) with (5) The top rows of both figures show the prediction performance of the learned parameters. The data is plotted as dots in the foreground and the learned classification in the background. A good classification has the blue dots in the dark blue areas and similarly for red. We can see that for both data sets Net classifies only a selection of the points correctly whereas the other three methods do rather well on almost all points. Note that the shape of the learned classifier is still rather different despite them being very similar in the area of the training data.
For the bottom rows of both figures we split the classification into the transformation and a linear classification. The transformation is the evolution of the ODE for ResNet, ODENet and ODENet+simplex. For Net this is the recursive formula (4). Note that the learned transformations are very different for the four different methods. Figure 4 shows the evolution of the features by the learned parameters for the data set spiral. It can be seen that all four methods result in different dynamics, Net and ODENet reduce the two dimensional point cloud to a one-dimensional string whereas ResNet and ODENet+simplex preserve their two-dimensional character. This observations seem to be characteristic as we qualitatively observed similar dynamics for other data sets and random initialisation (not shown).

Evolution of Features.
Note that the dynamics of ODENet transform the points outside the field-of-view and the decision boundary (fuzzy bright line in the background) is wider than for ResNet and ODENet+simplex.
Intuitively, a scaling of the points and a fuzzier classification is equivalent to leaving the points where they are and a sharper classification. We tested the aforementioned effect by keeping a fixed classification throughout the learning process and only learning the transformation. The results in Figures 5 and 6 show that this is indeed the case.

5.2.3.
Dependence on Randomness. We tested the dependence of our results on different random initializations. For conciseness we only highlight one result in Figure 7. Indeed, the two rows which correspond to two different random initializations show very similar topological behaviour. Figures 8 and 9 which show the evolution of function values and the classification accuracy over the course of the gradient descent iterations. The solid lines are for the training data and dashed for the test data, which is an independent draw from the same distribution and of the same size as the training data.

Quantitative Results. Quantitative results are presented in
We can see that Net does not perform as well for any of the data sets than the other three methods. Consistently, ODENet is initially the fastest but at later stages ResNet overtakes it. All three methods seem to converge to a similar function value. As the dashed line follows essentially the solid line we can observe that there is not much overfitting taking place.

Estimation of Varying Time
Steps. Figure 10 shows the (estimated) time steps for ResNet/Euler, ODENet and ODENet+simplex. While ResNet uses equidistant time discretization, ODENet and ODENet+simplex learn these as part of the training. In addition, ODENet+simplex use a simplex constraint on these values which allow the interpretation as varying time steps. It can be seen consistently for all  four data sets that ODENet chooses both negative and positive time steps and these are generally of larger magnitude than the other two methods. Moreover, these are all non-zero. In contrast, ODENet+Simplex picks a few time steps (two or three) and sets the rest to zero. Sparse time steps have the advantage that less memory is   Table 1. Four explicit Runge-Kutta methods: ResNet/Euler, Improved Euler, Kutta(3) and Kutta(4). needed to store this network and that less computation is needed for classification at test time.

5.3.
Comparing different explicit Runge-Kutta architectures. We are here showing results for 4 different explicit Runge-Kutta schemes of orders 1-4, their Butcher tableaux are given in Table 1.
The first two methods are the Euler and Improved Euler methods over orders one and two respectively. The other two are due to Kutta [19] and have convergence orders three and four. The presented results are obtained with the data sets donut1d, donut2d, spiral, and squares. In the results reported here we have taken the number of layers to be 15. In Figures 11-13 we illustrate the initial and final configurations of the data points for the learned parameters. The blue and red background colours can be thought of as test data results in the upper row of plots. For instance, any point which was originally in a red area will be classified as red with high probability. Similarly, the background colours in the bottom row of plots show the classification of points which have been transformed to a given location. In the transition between red and blue the classification will have less certainty.
In Figures 14-16, more details of the transition are shown. The leftmost and rightmost plot show the initial and final states respectively, whereas the two in the middle show the transformation in layers 5 and 10. The background colours always show the same and correspond to the final state.
Finally, in Figure 17 we show the progress of the gradient descent method over 10,000 iterations for each of the four data sets. Figure 9. Classification accuracy over the course of the gradient descent iterations for data sets donut1d, donut2d, spiral, squares. The solid line represents training and the dashed line test data.

Conclusions and outlook
In this paper we have investigated the interpretation of deep learning as an optimal control problem. In particular, we have proposed a first-optimise-then-discretize approach for the derivation of ODE-inspired deep neural networks using symplectic partitioned Runge-Kutta methods. The latter discretization guarantees that also after discretization the first-order optimality conditions for the optimal control problem are fulfilled. This is in particular interesting under the assumption that the learned ODE discretization follows some underlying continuous transformation that it approximated. Using partitioned Runge-Kutta methods we derive several new deep learning algorithms which we compare for their convergence behaviour and the transformation dynamics of the so-learned discrete ODE iterations. Interestingly, while the convergence behaviour for the solution of the optimal control problem shows differences when trained with different partitioned Runge-Kutta methods, Figure 10. Estimated time steps by ResNet/Euler, ODENet and ODENet+simplex for for data sets donut1d, donut2d, spiral, squares. ODENet+simplex consistently picks two to three time steps and set the rest to zero. the learned transformation given by the discretised ODE with optimised parameters shows similar characteristics, supporting our hypothesis of an underlying continuous optimal transformation.
An interesting direction for further investigation is to use the optimal control characterisation of deep learning for studying the stability of the problem under perturbations of Y 0 . Since the optimal control problem is equivalent to a Hamiltonian boundary value problem, we can study the stability of the first by analysing the second. One can derive conditions on f and J that ensure existence and stability of the optimal solutions with respect to perturbations on the initial data, or after increasing the number of data points. For the existence of solutions of the optimal control problem and the Pontryagin maximum principle see [3,9,1,28].   (4). The data set is donut2d and we use 15 layers.
The stability of the problem can be analysed in different ways. The first is to investigate how the parameters u(t) := (K(t), β(t)) change under change (or perturbation) of the initial data and the cost function. The equation for the momenta of the Hamiltonian boundary value problem (adjoint equation) can be used to compute sensitivities of the optimal control problem under perturbation on the initial data [26]. In particular, the answer to this is linked to the Hessian of the Hamiltonian Figure 13. Prediction (top) and transformed points (bottom) for the Runge-Kutta methods (left to right) ResNet/Euler, Improved Euler, Kutta(3) and Kutta(4). The data set is squares and we use 15 layers.
with respect to the parameters u. If H u,u is invertible, see Section 2.3, and remains invertible under such perturbations, then u = ϕ(y, p) can be solved for in terms of the state y and the co-state p.
The second is to ask how generalizable the learned parameters u are. The parameters u, i.e. K and β determine the deformation of the data in such a way that the data becomes classifiable with the Euclidean norm at final time T . It would be interesting to show that ϕ does not change much under perturbation, and neither do the deformations determined by u.
Another interesting direction for future research is the generalisation of the optimal control problem to feature an inverse scale-space ODE as a constraint, where we do not consider the time-derivative of the state variable, but of a subgradient of a corresponding convex functional with the state variable as its argument, see for example [27,5,4]. Normally these flows are discretized with an explicit or implicit Euler scheme. These discretizations can reproduce various neural network architectures [2, Section 9]. Hence, applying the existing knowledge of numerical discretization methods in a deep learning context may lead to a better and more systematic way of developing new architectures with desirable properties.
Other questions revolve around the sensitivity in the classification error. How can we estimate the error in the classification once the parameters are learned? Given u obtained solving the optimal control problem, if we change (or update) the set of features, how big is the error in the classification J(y [N ] )?   i , p [j+1] and consider the Lagrangian (42) Figure 17. Function values over the course of the gradient descent iterations for data sets donut1d, donut2d, spiral, squares.
i = 1, . . . , s, An equivalent formulation of (3) subject to (21) In (43), renaming the indexes so that i → k in the first sum and m → k and w k in the second sum, we get for j = 0, . . . , N − 1, and Because each of the variations w [j] k is arbitrary for k = 1, . . . , s and j = 0, . . . , N − 1 each of the terms must vanish and we get and finally corresponding to the discretized constraints, and where we recognize that i .

The remaining terms contain the variations v [j]
i and we have which we rearrange into i ].