Partitioned integrators for thermodynamic parameterization of neural networks

. Traditionally, neural networks are parameterized using optimization procedures such as stochastic gradient descent, RMSProp and ADAM. These procedures tend to drive the parameters of the network toward a lo-cal minimum. In this article, we employ alternative “sampling” algorithms (referred to here as “thermodynamic parameterization methods”) which rely on discretized stochastic diﬀerential equations for a deﬁned target distribution on parameter space. We show that the thermodynamic perspective already improves neural network training. Moreover, by partitioning the parameters based on natural layer structure we obtain schemes with very rapid convergence for data sets with complicated loss landscapes. We describe easy-to-implement hybrid partitioned numerical algorithms, based on discretized stochastic diﬀerential equations, which are adapted to feed-forward neural networks, including a multi-layer Langevin algorithm, Ad-LaLa (combining the adaptive Langevin and Langevin algorithms) and LOL (combining Langevin and Overdamped Langevin); we examine the convergence of these methods using numerical studies and compare their performance among themselves and in relation to standard alternatives such as stochastic gradient descent and ADAM. We present evidence that thermodynamic parameterization methods can be (i) faster, (ii) more accurate, and (iii) more robust than standard algorithms used within machine learning frameworks.

1. Introduction.Neural networks (NNs) are an important class of complex, hierarchical models which have been used in recent years for a vast range of applications.
As impactful examples we mention the exploration of chemical structure [44], medical decision making strategies for palliative care [1] and Alpha Zero which is able to master a complex challenge, e.g., learning to play Go or chess, in the span of a few days [45].Yet there remain a number of mysteries regarding the performance of neural networks, their generality, and their ultimate reliability.An important practical challenge is that neural networks require considerable computational power for training and, in many applications, re-training.Neural networks are typically parameterized/trained using variants of (stochastic) gradient descent, where the parameters -the weights and biases of the neural network -are updated so that the training loss (the difference between the neural network output and the 'truth') is minimized.In this article, we describe new training methods suited to neural network parameterization which are applicable in a variety of settings.In this paper we focus on classification problems and single hidden layer perceptrons, although a paper on deep networks is currently in the making.Our methods combine two basic ingredients: (i) the use of additive noise within a framework of second order stochastic dynamics, and (ii) exploitation of layer structure which induces a partitioning of the parameters of the network.The algorithms we present build directly on recent ergodicity results obtained for Langevin and Adaptive Langevin algorithms [29,41,47].
An important performance measure of a trained neural network is its capacity to generalize from its training data to unseen (test) data.Although a neural network can perform extremely well on the data on which it was trained, the algorithm used for optimization may easily end up in a minimum which does not generalize well to unseen data, a phenomenon called overfitting.Several factors appear to influence the generalization capacity of a neural network, such as the number of parameters, initialization, learning rate, stopping criterion, activation functions, and numerical method used, and no clear consensus has been reached on how these concepts interplay with one-another.Zhang et al. (2017) [55] found that traditional complexity measures from statistical learning theory are incapable of explaining several features of the generalization behaviour of deep neural networks (DNNs).In particular, they demonstrated that neural networks have such a high capacity that they can memorize the training data and can obtain zero training error on random labels (when using an architecture that gave good generalization properties when training with real labels).Explicit regularization techniques are unable to reliably attenuate this phenomenon [55].Regularization, which adds a parameter norm penalty term to the loss function of neural networks, is a standard approach to prevent overfitting, but does not necessarily affect the generalization error.
So how do we find parameterizations that generalize well?Loss landscapes of deep neural networks are known to possess many low-loss minima [4], but not all of these minima generalize equally well and different optimizers may find different solutions [36,52,19].The loss landscapes of neural networks are difficult to interpret due to their high-dimensionality and non-convexity.One would expect that optimizers are likely to get stuck in isolated local minima, but this was disputed by Goodfellow et al. (2015) [13], who show that a large variety of neural networks never encounter any obstacles on their optimization path, i.e., the loss from the initial to the final optimization step typically decreases monotonically.This helps to explain the success of methods such as stochastic gradient descent (SGD) in optimizing neural networks, despite the non-convexity of the objective functions.However, in this paper we argue that the results obtained by Goodfellow et al. (2015) [13] do not hold for some common types of problems, for which SGD -as well as improved optimizers such as Adam [24]-can be shown to fail.This failure is likely a consequence of the more complex structure of the loss landscape of these problems.This motivates the development of more sophisticated schemes for enhanced exploration of the low loss states in these settings.
1.1.Bayesian perspective on neural network training.In this article, we focus on the training (parameterization) process for neural networks using ideas from statistical mechanics.Neural networks approximate a function y = f (x), f : R m → R n by an abstract family of maps having a simple homogeneous form.
Here consider a single hidden layer perceptron network with the structure where x ∈ R m is an input data vector, w (1) ∈ R d×m and w (2) ∈ R n×d are matrices that contain the weights of the various layers, b (1) ∈ R d and b (2) ∈ R n are the biases, z ∈ R d is the networks output after passing the input data through the first layer, and ŷ ∈ R n is the neural network's approximation of the (for a classification problem) true labels y.The function ϕ is taken to be a ReLU activation function [20,12] in our experiments (for other examples of activation functions we refer to [53]).The function ϕ 0 is taken to be either a sigmoid, for a binary classification problem, or a softmax, for the MNIST data set.The equations define a map Φ : R m × R q → R n , where q = md + d + (d + 1)n is the dimension of the parameter space, that is we have ŷ = Φ(x, θ), where θ contains all the parameters of the neural network.The data D fed into the neural network consists of pairs of input data x and their labels y.The loss function is then determined by the difference between the neural network output ŷ and the true labels y.In our experiments we used a binary cross entropy loss function (for binary classification) or cross entropy loss function (for MNIST), but many different loss functions are available [34].
In this article, we take the Bayesian perspective, that the parameters θ are defined by the data D only in the sense of a probability distribution defined by Bayes' formula, ρ(θ|D) ∝ ρ(D|θ)ρ 0 (θ), where ρ(D|θ) is the likelihood, which for a cross entropy loss function takes the form ρ , and ρ 0 (θ) encodes prior knowledge of θ.The exploration of values of θ that are consistent with Bayes' rule then becomes the outstanding challenge.When ρ(θ|D) is unimodal and convex it is natural to choose θ as the mode of the target distribution by maximizing the posterior probability density, a technique referred to as "MAP," for "maximum a posteriori probability," but in practice this does not hold for neural networks and it then becomes a challenge to identify all relevant possible parameter values, and to compare different parameter choices in terms of their relative probabilistic weight.This task is referred to as sampling, and thus the Bayesian parameterization problem naturally reduces to a sampling problem for the parameters of the model.While the idea of Bayesian modelling is commonplace in all areas where statistics is used, the Bayesian perspective is usually only viewed as the starting point for optimization schemes in the setting of high dimensional neural networks, due to the vast amounts of data and parameters involved [35].We argue here that the sampling approach provides parameterization candidates with as great or greater efficiency than standard optimization schemes.
A single parameter vector is typically not a meaningful way to characterize a model since it fails to capture the fundamental statistical nature of the relationship between data and model.Assuming that θ is a random variable partially constrained by the knowledge of the training set, the output ŷ of the neural network is also a random variable with its own probability distribution directly related to that of θ.We can compute the mean of the output ŷ from the parameter distribution: where ρ is the normalized Bayesian density for θ.We sample parameter space by generating a sequence of discrete values of θ defined by some Markov chain θ 0 → θ 1 → . . .Then we simply approximate ȳ by Depending on the application, it is often enough to perform a draw of a singleton from the distribution of parameters thus generated.Note that the Bayesian approach gives access to the mean ȳ, as well as other statistics (such as variances) by a similar procedure.It is also possible to rely on the mode (or several modes, in complicated systems) as proposed parameter values and to examine the sensitivities of those parameters using averaging methods.
Several issues are raised by the use of MCMC methods, such as equilibration of the Markov process (the "burn in" phase, in the language of statistics), the problem of high correlation among the samples taken along the sampling path, and the actual computational procedure by which such samples can be generated efficiently.We will not address all the issues here, but we will show that taking a sampling perspective can cast new light on some challenging problems in machine learning.
There is a well known link between posterior sampling and MAP estimation.Introduce the negative log posterior l(θ) = − ln ρ(θ|D), and define For τ = 1 we have the posterior density.For τ → 0 we obtain a sequence of distributions which, although globally supported, have their mass confined progressively closer to the mode of the distribution.Thus we can think of MAP as an extreme form of sampling in which the sampled distribution is more and more confined to the vicinity of the mode or modes.In this setting, τ becomes a parameter of an embedded family of models which may be used to enhance the optimization process.An example is the process known as annealing, where τ is gradually driven from higher to lower values [25]. 1 The parameter τ plays precisely the same role as temperature in statistical physics, thus the use of the term thermodynamic parameterization to describe methods that rely on this embedding (and the sampling of the associated family of probability distributions) to enhance the parameterization procedure.
We note that the full exploration of the parameter space taken as a region of Euclidean space would be implausible in high dimensions.Neural networks are sometimes used with millions or billions of degrees of freedom and there is no conceivable way to fully explore such a space.On the other hand a very small range of parameter values are likely to be interest (the ones that have relatively large statistical weight with respect to the probability distribution).Moreover, there is often much to be gained by exploring parameters in the vicinity of a local maximum, i.e. by short sampling paths.It is important to recognize that MAP estimation, as normally practiced, is local, not global, optimization.Molecular dynamics [28] provides an obvious illustration of the potential value of the sampling paradigm in very high dimensions.
1.2.The parameterization process using stochastic gradients.In this subsection, we outline the standard training procedure based on stochastic gradient descent.In subsection 1.3 we shall discuss the alternative stochastic gradient Langevin dynamics method as an illustration of a sampling method.
The starting point for most training schemes is a system of ordinary differential equations of the form dθ = G(θ)dt, where the function G is the negative gradient of loss function L(θ|D) defined in terms of the entire training data set D. Such gradient systems have the feature that, along their solutions θ(t), we have implying that the loss decreases monotonically along solutions.Since local minima are stationary points one hopes that this dynamics steadily drives the parameters to such local minima.The most common numerical method used for solving the system is the explicit Euler method for a choice of discretization stepsize h > 0. When the gradient is approximated by evaluating it on a randomly sub-sampled partial data set we introduce gradient noise into the dynamics.This noise can be approximately modelled by replacing G in each evaluation by G(θ) = G(θ) + Σ(θ)R where ΣΣ T is the noise covariance matrix and R a standard normal random vector with iid components.We can thus re-interpret the training process as being which we recognize as Euler-Maruyama discretization of the Itô SDE [10] dθ = G(θ)dt using a stepsize h.It is an odd feature of the process that the discretization stepsize appears in the right hand side of the SDE itself [30].The ratio of step size to batch size (the size of the sub-sampled data set) was shown by Jastrzȩbski et al. (2017) [21] to directly influence the type of parameterizations found by this method.
The system (2) is driven by multiplicative noise.Since the gradient noise defined by Σ(θ) is complicated, this system of SDEs has an unknown invariant distribution which will depend on the subsampling.However, if h → 0 in (2) it is clear that we arrive eventually at a local minimum of the loss.One assumes that for a small value of the stepsize the consequence is that we arrive near such a local minimum, or, to be precise, due to the inherent degeneracy of neural networks, near to a manifold of local minima of the loss.
One might worry that the dynamics could be drawn frequently toward saddles where ∇G = 0.Although such points are unstable -under continual perturbation the dynamics bypasses the saddles and local minima are indeed eventually located-there are other downsides to relying on gradient flow as the foundation for training algorithms.Namely, we can only ever count on gradient dynamics as a local minimization procedure.It has, a priori, no mechanism for global exploration.Introducing ad hoc mechanisms to increase exploration is prone to failure since, in high dimensions, there is no natural way to grid the parameter space.
There exist an increasing number of methods, in the same class as SGD, which are based on accelerating the scheme described above.These include SGD with Momentum (see subsection 2.4), RMSProp [49], AdaGrad [8] and Adam [24].Although the efficacy of these methods in large scale machine learning is an active area of research [52], we have found that these can sometimes improve training, if carefully adjusted by choice of parameters (most important -the stepsize).In most of the cases considered in this paper, Adam gave substantially better results than SGD.
1.3.SDE-based schemes in machine learning.Stochastic Gradient Langevin dynamics (SGLD) [50], the Unadjusted Langevin Algorithm (ULA) [40,9] and Stochastic Gradient Nose Hoover Thermostat (SGNHT) [22,6] are examples of existing thermodynamic sampling methods.In SGLD one introduces an additional additive noise into (2) resulting in The additive noise is usually taken to have constant variance.2SGLD is typically discretized using Euler-Maruyama, resulting in At this stage, we see that for small h, the √ h term will strongly dominate the noise, and conclude that if we replace the constant stepsize h by a decaying sequence of stepsizes h n → 0, we would, in the long term, expect to generate states from the stationary distribution, thus the claim that SGLD is a sampling method for a known distribution.The mathematical analysis of this method relies on the framework known as "stochastic approximation" [26].The caveat of course is that such a rigorous procedure requires the use of small stepsizes which would be expected to slow the sampling process.In practice a small bias is accepted in exchange for being able to more efficiently sample the target distribution, although Brosse et al. (2018) [3] argue that the high variance of stochastic gradients can limit the usefulness of SGLD in practice.
In this article we propose to use, as in SGLD, additive noise (which has an adjustable but fixed strength) to stabilize the invariant measure of the stochastic dynamics.In contrast to SGLD however, we rely on underdamped Langevin dynamics and apply state-of-the-art discretization methods [28,47], which introduce additive noise within a framework of second order stochastic dynamics.Additionally, we partition our algorithm based on the natural layer structure of the neural network.
For properties of the partitioned algorithms we draw on three recent works: (i) hypoelliptic properties of Langevin dynamics numerical methods [29], (ii) hypoelliptic properties for Langevin dynamics with configuration-dependent noise [41] and (iii) very recent work on weighted-L 2 hypocoercivity of Adaptive Langevin dynamics [47].To connect our methods with these works recall that our methods use additive noise and that in practice there will also be a second term with an unknown covariance arising from the gradient approximation.Such an approach is close to the systems treated in [41] where the SDEs take the form of a Langevin system where the friction matrix Γ is allowed to vary with position q (which in the results above corresponds to our parameters θ) dq = p dt, dp = G(q)dt − Γ(q)p dt + Σ(q)dW. ( Nondegeneracy of Σ(q) is required for the results of [41] to hold, but we will obtain that by driving the system by additive noise of defined strength (in each momentum equation).In the case of the method AdLaLa (described in subsection 3.3) which makes use of Adaptive Langevin (AdL) dynamics we have hypocoercivity results for AdL [47], which can be used to justify the method.These methods are based on position-independent noise.We conjecture that these hypocoercivity results can be extended to systems with position dependent noise.
1.4.Improving stability of neural network parameterization using partitioned stochastic methods.In this paper we make use of the layer structure of neural networks to obtain partitioned algorithms that use a different optimizer for different parts of the network.We show for certain datasets that these schemes can significantly accelerate training.There have been a number of attempts in recent years to design better training strategies by relying on the detailed structure of neural networks.For example the method AdaDelta [54] attempted to use an adaptive procedure to vary the learning rate (integration stepsize) according to dimension.Singh et al. (2015) [46] looked at using different stepsizes to treat the weights and biases in different layers.Although the method developed showed improvements compared to using the same stepsize, the gains were small.An effort which may be potentially more relevant to our article is the work of Lan et al. (2019) [27] which found that freezing the last layer (i.e., fixing the weights and biases in the last layer) results in significant performance gain.We are not aware of an effort to use differential thermostatting among layers in the design of training algorithms.In several of our experiments we found it advantageous to use low temperatures (even zero temperature) in the output layer but to maintain the hidden layer weights and biases at slightly elevated values.This means that those inner parameters can rapidly explore a wide range of lowloss states.We conjecture that it is this fluidity in the hidden layer which gives the LOL and AdLaLa methods described here their improved convergence speed.
It is well-known that local minima can be very sensitive to small changes in the choice of hyperparameters.This sensitivity has implications for the reliability and stability of training algorithms.Standard methods to improve stability of neural networks include L 1 and L 2 regularization a la Tikhonov [48,51,16].In our experience, these methods cannot be relied upon to improve the test accuracy of a classifier, where the term "test accuracy" indicates how many of the (during the training process unseen) test data points are correctly classified by the trained neural network.
We suggest that stochastic differential equations impose a different form of regularization, since the SDEs incorporate additive noise.A notable ramification is that thermodynamic parameterizations appear to give rise to classifiers whose level sets are relatively smooth compared to those produced by alternative methods.Thermodynamic parameterization thus effectively controls the distribution of weights-more precisely the distribution of the conjugate momenta associated to the weights, due to the statistical mechanical property known as equipartition of energy.By drawing parameter states from a sufficiently rich distribution of nearby candidate states, we show that the thermodynamic schemes produce smoother classifiers, improve generalization and reduce overfitting compared to traditional optimizers.In our studies of spiral and other data sets herein, we did not make use of any regularization method, which did not appear to affect our obtained test accuracies.
A benefit of using the thermodynamic parameterization approach as outlined here is to reduce the dependence of the training result on the initial conditions or the details of the mechanism of training.Unlike in conventional stochastic gradient descent and other schemes, the methods we advocate are formally ergodic, meaning that they have a unique stationary distribution and (almost all) trajectories converge to sampling paths for the same target distribution.This provides another way in which these schemes can improve robustness.Even if, in practice, we are unable to see the entire distribution due to computational limitations, it is desirable that the process can in principle be improved by continued exploration.
The per-step cost of our methods is (unless otherwise noted) roughly similar to that of the other training methods such as Stochastic Gradient Descent and ADAM, assuming the major cost of a timestep is dominated by the computation of the approximate gradient.We examine the relative performance of the different methods in detailed series of numerical experiments.We also examine, again in numerical experiment, the key question of the variance of the results obtained by different methods, which points to the reliability and robustness of the schemes.

Langevin and Adaptive Langevin schemes.
In what follows, let L(θ) represent the overall loss defined in relation to the training data set D (where we have suppressed the explicit dependence of the loss L(θ|D) on D for simplicity of notation).We suppose the loss to be piecewise smooth, Lipschitz continuous, for example as obtained using mean square error or cross entropy.We may augment the model by a (mild) quadratic regularization to ensure confinement of the parameters.
All algorithms considered here are based on gradients.We let G(θ) := −∇ θ L, i.e. the (full) negative gradient of the loss, and denote by G the truncated negative gradient obtained by selecting a randomized (uniformly sampled) finite subset of the data D ⊂ D at each timestep of fixed size.In all algorithms, the stepsize (learning rate) is denoted by h.The temperature parameter used in the thermodynamic algorithms is denoted by τ ≥ 0. R n typically represents a vector of i.i.d.standard normal random numbers drawn at timestep n.
In this paper we will primarily be concerned with the use of degenerate stochastic differential equations (SDEs) as the mechanism of parameterization.We may write these in the Itô formalism [10] as The degeneracy lies in the fact that Σ is not necessarily of full rank.This family of SDEs includes the underdamped and overdamped forms of Langevin (Brownian) dynamics.It also includes various thermostat methods such as Adaptive Langevin dynamics which is based on the stochastic generalization of the (determnistic) Nosé-Hoover thermostat.
2.1.Langevin dynamics.Consider the Langevin dynamics [28] system of SDEs: where θ and p are the position and momentum vectors respectively, W t a standard N -dimensional Wiener process, and Γ and Σ are symmetric positive definite matrices, which we shall assume to be position-independent in the remainder of this paper.In the special case where for scalar τ > 0 the dynamics obeys a fluctuation-dissipation theorem, and under some mild assumptions is provably ergodic (see Section 5).This ensures that solutions of the dynamics sample the distribution ρ τ (θ, p) where As this stationary distribution doesn't depend on the friction term Γ, a common simplification is to simply choose Γ = γI and Σ = √ 2γτ I in (7).In what follows, we will make use of this convention.

Langevin Dynamics Splitting Methods.
A popular way of building discretization schemes for Langevin dynamics is via the use of splitting methods [28,29].Such schemes are developed by writing the vector field as an additive decomposition (a "splitting") into separate parts and solving for each piece in sequence.In this article we shall use a Langevin splitting into pieces denoted A, B and O: which, when taken individually, can be solved 'exactly' in its evolution of distribution [28].Individual update maps of the splitting pieces are then given by The last expression in Eq. ( 9 We can code schemes by changing the order in which we apply the updates, with repeated letters indicating substeps (i.e. two 'A's indicate that each should be a half step of size h/2).For example, using the update rules in Eq. ( 9) the BAOAB scheme is given by In the case of Langevin dynamics applied to systems with gradient noise, we can understand a little the interplay of stepsize and friction by reference to a simplified model in which the gradient noise is assumed to be described by a stationary Gaussian process.Taking for simplicity scalar friction and a common scalar noise amplitude we replace Eq. ( 6)-( 7) by where the appearance of h is the consequence of the same argument presented in the introduction.In the absence of gradient noise this system samples the canonical distribution for temperature τ .We next combine the noise terms to obtain This corresponds to Langevin dynamics at the effective temperature This relation suggests to take the stepsize in proportion to γ in order to maintain an approximately constant temperature as either parameter is varied.
2.3.Role of Temperature.To make clear the role of temperature in parameterization of neural networks, we present in Fig. 1 four classifiers for planar trigonometric data (see Sec. 4 for a full description of this data set).Each classifier was obtained using Langevin dynamics and a single hidden-layer perceptron (SHLP) for a fixed amount of work, but was parameterized with different temperatures.Both the test accuracy and qualitative features of the classifier change with the temperature parameter, with results significantly improving as temperature increases.In further experiments we observed that further increases of the τ parameter can negatively affect the results, suggesting a 'Goldilocks' temperature region of optimal efficiency.A hypothetical model for the cause of the performance gain due to elevated temperature might be found by considering molecular diffusion on a rough energy landscape [56,39].In a corrugated energy surface and at zero temperature the system will likely get stuck in local minima, lacking the required energy to overcome barriers blocking movement between states.Increasing the temperature allows weak interaction with a heat bath, randomly introducing energetic fluctuations into the system that can move it over barriers and away from local minima.The size of the fluctuations can be related to the temperature parameter: too small and it will take a long time to cross barriers, whereas too large and the system will not be drawn towards the global minimum.

2.4.
Relation between Langevin sampling methods and certain schemes in the literature.If we assume that a fluctuation-dissipation relationship holds and use Γ = γI then, by applying the corresponding mappings, the OBA Langevin scheme can be written as Algorithm 1.

Algorithm 1
The OBA Splitting Scheme p ← p + h G(θ) return θ, p We use G to denote the truncated negative gradient obtained by selecting a randomized finite subset of the data at each time step.We can define a family of schemes through specific choices of friction γ and temperature τ .Some schemes correspond to existing schemes in the literature.For example, setting τ = 0 and using finite friction we arrive at a reparameterization of the SGD with momentum scheme, with two variants given in Algorithm 2. Choosing the damping parameter µ = h exp(−γh) and learning rate δt = h 2 in type I we recover the OBA scheme with τ = 0. Similarly in type II we reparameterize µ = (1 + h exp(γh)) −1 and δt = h + exp(−γh).It is clear that we recover the traditional SGD scheme if γ → ∞, or equivalently if µ → 0.

Algorithm 2
The OBA Splitting Scheme with τ = 0 1: procedure OBA tau is zero(θ,p,γ,T ,h) for t ← 1 to T do Similarly we may consider the limiting case of infinite friction and positive τ in the OBA scheme, where the momenta are redrawn from their distribution at every step.The resulting scheme (see Algorithm 3) matches the SGLD scheme with a reparameterization between temperature and noise strength 2 = τ and learning rate δt = h 2 .We may extend SGLD to include momentum by instead using a finite friction parameter γ in Algorithm 1.
Thus, with a specific interpretation of the coefficients in SGD-with-momentum and SGLD we can obtain certain Langevin integrators.All of the schemes which are of standard type are of low order of accuracy and are relatively crude in their construction; in molecular dynamics it has been shown that schemes like BAOAB Algorithm 3 The OBA Splitting Scheme with infinite friction 1: procedure OBA infinite friction(θ,τ ,T ,h) R is a vector of i.i.d.Gaussian random numbers 4: return θ 7: procedure SGLD(θ, ,T ,δt) R is a vector of i.i.d.Gaussian random numbers 10: return θ substantially improve on sampling accuracy.We thus look to the family of splittingbased methods (and further generalizations as described below) to provided enhanced training strategies.
2.5.Adaptive Langevin and the Nosé-Hoover thermostat.Adaptive Langevin dynamics (AdL) is a method in which the friction parameter of Langevin dynamics is automatically determined by an isokinetic control law.The method derives from Nosé-Hoover dynamics developed by S. Nosé and W. Hoover in the early 1980s.
Their proposal was to use a deterministic system to sample from the canonical ensemble [37,17].The Adaptive Langevin method, which incorporates additive noise, was first elucidated in [22] and has since been employed in a variety of multiscale modelling and noisy gradient settings [6].Analyses of this method can be found in [30,15,47].
The equations take the form of a degenerate SDE system: The hyperparameters are the coupling coefficient ε, the number of parameters N , the temperature τ , and the driving noise amplitude σ.
If we assume as in subsection 2.2 that a Gaussian stationary process defines the gradient noise, then we may rewrite ( 12)-( 14) as a system with a clean gradient of the form According to [22], this system will sample the canonical distribution at temperature τ .This implies that εEξ ≡ γ eff , In other words, higher additive noise σ leads directly to larger effective friction.Also larger gradient noise effectively increases friction.
Various discretization schemes are obtained by breaking up the AdL system into pieces (as in the discretization of Langevin dynamics) and solving the parts separately.The maps A and B mentioned below are identical to those used described in the context of Langevin dynamics, although formally they need to be supplemented by an identity mapping of ξ.
The simplest approach is to define the additional maps C, D, E by An obvious method is then defined by the composition BACDEDCAB: As an alternative to the above method, one may note that the components in C and D may be combined, resulting in an Ornstein-Uhlenbeck equation which can be analytically solved (in the weak sense).That is, we let F h represent the weak solution of the equation dp = −εξpdt + σ A dW, with ξ held constant and substitute this F step in place of C and D. Care must be taken to treat small values of ξ within this scheme.We tested both methods, but did not observe notable differences in performance.We proceeded to use the first method for all our experiments.
3. Partitioned discretization algorithms for neural networks.In layered or hierarchical models, e.g.deep neural networks, we have a natural partitioning of the parameter vector according to its role in the hierarchy.It may be useful to treat the parameters at different levels of the hierarchy differently in the parameterization process.In particular, it is possible that, either due to design or some feature of the network, the characteristics of the gradient noise introduced at different layer depths may differ, and it is then natural to design a method that treats the components independently.Lan et al. (2019) [27] observed that freezing the last layer of a neural network (while using SGD with momentum for the flexible components) can enhance the performance of training algorithms.We draw on this idea here for motivation in developing a family of partitioned algorithms that can be used to train neural networks.
In this article we focus on single hidden layer perceptrons, for which we shall use a two-part partitioning.Let θ = (θ (1) , θ (2) ) be a partitioning of the full parameter vector, and assume a similar partitioning of the momenta (p (1) , p (2) ) as well as of the Wiener process W (t). The partitioning can be defined in various ways.For example we could group together the weights and biases at each layer This is the approach we have taken in our experiments.In extending this framework to deep neural networks one could also include several layers (or all hidden layers, say) as one part of the partitioning.We next describe a number of different families of partitioned integrators which could be used to take advantage of the layer structure.
3.1.Langevin in layers.The simplest idea is to use different Langevin parameters in different layers (or alternatively, Langevin with an anisotropic diagonal friction matrix).Since temperature is purely formal in machine learning, we can, without concern for physical meanings, introduce an artificial temperature gradient by using different temperatures τ 1 , τ 2 in the different layers.Meanwhile, we do keep the learning rate fixed at the same value across all layers and throughout training.The equations then become where the indices i = 1, 2 represent the different layers.Each subsystem can be propagated using BAOAB or some other Langevin integrator.

Langevin-Overdamped Langevin (LOL).
Consider a partitioned two-part model on which we use BAOAB.Taking the friction to infinity in the last layer, namely taking the limit γ 2 → ∞ results in an alternative method with a strong stabilizing property.The equations become n , We refer to this method as Langevin-Overdamped Langevin or LOL.The motivation for this scheme is that it gives the possibility to increase the exploration of hidden layer structure (including the weights and biases defining the dependence on the input) while incorporating a strong dissipation in the connection to the output layer (which provides a strong stabilizing property).In most of our experiments we also set the temperature of the secondary partition to be zero, i.e., we set τ 2 = 0.In this scenario one may also interpret the combined method as a sort of free energy minimization of the output layer weights and biases.

Adaptive Langevin and Langevin in layers (AdLaLa).
As mentioned in subsection 2.5 the Adaptive Langevin method (AdL) has the property that it can automatically maintain a target temperature in a system driven by Gaussian noise.While the gradient noise encountered in statistical approximation is not, by any means, Gaussian, it may have an important Gaussian component that can be controlled using this device (as observed in practice, see [6,30]).We therefore consider a modification of the Langevin in layers scheme in which the Adaptive Langevin method is used to manage the sampling of part of the system, thus extracting accumulated heat due to gradient noise.
Applying Adaptive Langevin (AdL) in layers leads to the system, for i = 1, . . ., d: The parameters for layer i are the coupling coefficient ε i , the temperature parameter τ i , and the applied noise amplitude σ A,i .Discretization then proceeds as for AdL using either of the two mentioned variants (see subsection 2.5) or some other scheme.
It is also possible to have a partitioned algorithm with some components treated using Adaptive Langevin and others using a Langevin scheme.As a simple instance of such a method, consider the two-part "AdLaLa" partitioning: dθ (1) = p (1) dt, Again we keep the learning rate fixed at the same value across all layers and throughout training.In the extreme case, where τ 2 = 0 the second part can be viewed as a dissipated gradient system and thus we may think of this as analogous to gradient descent with momentum, but the adaptive control of the first subsystem may provide greater flexibility in the approach to the overall minimum.
4. Model problems for classification.We examine parameterization of fully connected single hidden-layer neural networks with ReLU activation in the context of binary classification of spiral and trigonometric data, as well as the MNIST data set.We found that the results were significantly different for the different problem classes, with MNIST data showing fewer substantial differences among schemes.In order to cast some light on this issue, we use the technique of 1D linear interpolation proposed by Goodfellow et al. (2015) [13] and a surface plotting technique [19].
The spiral data sets we use in this article are generated from the formulas x 1 = at p cos(2bt p π) + cN (0, 1), In these formulas, t is drawn repeatedly from U(0, 1) to generate data points, where U is the uniform distribution.This creates one arm of the data set, to which we assign class label 0. The other arm constitutes a shift in the argument of the trig functions by π.We typically set a = 2, p = 0.5 and c = 0.02, unless otherwise indicated.When we vary b, this directly affects the number of turns of the spiral and therefore the complexity of the problem.In the trigonometric data set the data is given for class 0 by Data for class 1 is generated by the same equations but with cosine replaced by sine.Typical classification data are shown in Fig. 2. Depending on the choice of parameters, these can be very challenging test cases for classification, due to the consequent structure of the loss landscape.In particular, we believe that the training algorithm encounters significant loss-barriers for these types of data sets.For this reason, methods such as SGD and ADAM, which, up to gradient noise, monotonically decrease the loss, can easily become trapped in unsuitable states or be slowed down by the presence of saddle points.By contrast MNIST data and related image classification problems may be relatively free of these issues.This is supported by results from Ballard et al. (2017) [2] who show (using molecular potential energy landscape visualization techniques) that the obtained landscape for MNIST is single-funnel-like, with only small barriers separating the different local minima from the global minimum.In contrast, they observe large barriers in the landscape for a non-linear regression problem, thus demonstrating that there exist fundamental differences in the structure of the loss landscape for different training problems.In Huang et al. (2019) [18] they illustrate this by designing a problem that standard optimizers will find very challenging.They set-up a binary classification problem, where they pinch the margin between two rings of datapoints, which causes any good minimizer to be "sharp".The small volume of the corresponding basin makes these minima less likely to be found by standard optimizers.Below we include our approach to demonstrate the difference between the spiral and MNIST datasets, drawing on a method proposed by Goodfellow et al. (2015) [13].
1D Linear Interpolation: Denote the initial parameter configuration by θ 0 and the parameter configuration after running the optimizer by θ f .Define We graph the loss L (θ * (α)) as a function of α.At α = 1 the loss is small, while at α = 0 the loss is at a random state.
For MNIST (Fig. 3) our results are similar to the findings of Goodfellow et al. (2015) [13], specifically they observe, "We find that the objective function has a simple, approximately convex shape along this cross-section.In other words, if we knew the correct direction, a single coarse line search could do a good job of training a neural network".18) for the MNIST dataset.It is clear that AdLaLa and Adam converge to different minima, although we used the exact same initialization for both methods.There is no evidence of a loss-barrier.Their final test loss is similar.Right: the same construct for a simple spiral with one turn, i.e., b = 1 in Eq. ( 16).As for MNIST there is no evidence of a loss-barrier.By contrast, in the spiral dataset (with more than 1 turn, i.e., b > 1 in Eq. ( 16)) we observe that there is typically a barrier between the loss at the initial parameter configuration θ 0 and the loss at the parameter configuration found by the optimizers (see Fig. 4).The barrier appears to consistently be significantly higher between θ 0 and the θ f that AdLaLa finds, than between θ 0 and the θ f that Adam finds (θ 0 is the same for both methods).This indicates that AdLaLa finds different kinds of minima compared to Adam, which generally have lower test loss.For the trigonometric dataset, the obtained curves were generally similar to those for the spirals-2turns problem, although the height of the barrier is typically lower.We emphasize that these plots do not represent the actual path that the optimizer traverses, but do seem to point at a significant difference in the loss landscape structure of the MNIST vs. spiral/trigonometric datasets.We will elaborate on this point by constructing some surface plots.
Surface plots: It is possible to visualize the saddle by exploring a 2-dimensional cross-section in the loss landscape.Denote the initial parameter configuration by θ 0 now run the optimizer twice to obtain two distinct minima: θ f,1 and θ f,2 .
, so the loss minima are given by (α = 1, β = 0) and (α = 1, β = 1).We observe in Fig. 5 that for MNIST data set there is a consistent monotonic decline in loss along the line from the initial parameterization to the final parameterization.However, for the 2-turn spirals we frequently observe loss landscapes with saddle points in the cross-sectional plane.This seems to indicate a fundamental difference in the nature of these problems and the flexibility of the optimizers required to tackle them.We note that our low-dimensional intuitions often do not translate to the high-dimensional case: critical points with high error are exponentially likely to be saddle points, rather than local minima, which means that saddle points are thought to be the more likely cause of a possible impediment of optimization [5].

5.
Properties of the thermodynamic parameterization methods: ergodicity, equipartition and smooth classifiers.The principles of thermodynamics and the theory of hypoelliptic diffusion underpin the stochastic integrators that we have proposed previously in this article.The conditions for an SDE system to be ergodic are discussed in numerous recent works.We summarize these as used in our own recent studies of ergodic properties of Langevin and generalized Langevin equations.
Consider the Langevin system ( 6)- (7).The starting point for analysis of SDEs is the Fokker-Planck equation [10] where where ∆ p is the Laplacian in the momenta components only Assuming G is smooth it is possible to find conditions which ensure that the system is ergodic in a weighted L ∞ space; this is the usual approach based on Harris chains that one finds described in detail in the excellent book of Meyn and Tweedie [33].
For Langevin dynamics, the analysis was first carried out in detail in [32].More recently, an alternative framework has become available which is in many ways more directly suited to applications of SDEs to machine learning.This is the method described in the work of Dolbeault et al. (2009) [7], which allows the derivation of exponential convergence rates when the Fokker-Planck operator is considered in a suitable subspace of L 2 (µ τ ), i.e. weighted by the canonical invariant measure µ τ .The method can be shown to give convergence estimates for underdamped Langevin dynamics.
In very recent work, the same framework was applied to the Adaptive Langevin dynamics system [47].The power of L 2 estimates is that they can be used to establish a Central Limit Theorem which is very important in statistical applications.
Although we have not yet looked in detail at hypocoercivity for the more complicated partitioned methods discussed here such as AdLaLa, LOL etc. (and it is certainly beyond the scope of this paper to do so), we expect that the weighted L 2 approach as used for AdL in [47] could be applied to these systems as well, in order to establish the ergodic property.By contrast, for any of the deterministic schemes mentioned and for schemes relying solely on gradient noise, ergodicity is very unlikely to hold and we are unaware of any mathematical technique that could be used for their analysis.When additive noise is combined with gradient noise, assuming enough boundedness, a unique invariant measure still can be shown to exist using weighted L ∞ techniques [41].
Out of the 100 runs we also compared the parameter distributions for the run with the worst test accuracy vs. the run with the best test accuracy.We observe that Adam performs worse if a larger percentage of the weights and biases are zero.The same holds for LOL, although the difference in accuracies is less dramatic between the worst and best run (10% difference in test accuracy for LOL, 35% for Adam).For AdLaLa there is even less variation in the accuracies obtained and the weights appear to be always approximately equally distributed around zero.
6. Numerical Studies with Thermodynamic Parameterization Methods.Tests of the various methods were conducted using three separate codes for crossvalidation and verification of consistency: • We used custom a PyTorch-based [38] system [version 1.0.0].
• We implemented the schemes into the latest version of the DLIB package [23] written in C++.• We created a custom native C++/QT application to perform rapid visual exploration of the training algorithms.Code that implements the algorithms described in the article is available on github.com/TiffanyVlaar/ThermodynamicParameterizationOfNNs.6.1.Choice of hyperparameters.Despite the high accuracy and rapid convergence of our partitioned schemes (as we illustrate below), a practitioner may consider the relatively large amount of hyperparameters of these methods to be a disadvantage.We wish to emphasize that one can use certain rules of thumb to select the values of these hyperparameters, which significantly reduces the work required in tuning.Additionally, we note that our methods appear less sensitive to the choice of initialization (see subsection 6.5) or the subsampling batchsize, which can be viewed as reducing another aspect of "tuning" and is thus a major performance gain compared to e.g.SGD or Adam.We also do not change the stepsize (learning rate) throughout training and are still able to obtain great performance using our methods.
As rule of thumb for AdLaLa, one can typically set the temperatures of all layers to 10 −4 , the coupling coefficient ∈ [0.05, 0.1], additive noise σ A ∈ [10 −2 , 10 −4 ] and obtain good performance.In some cases there was an advantage to using a lower temperature for the output layer.The value of γ (associated to the output layer) appears to be linked to the stepsize, consistent with the discussion of subsection 2.2.We recommend γ ∈ [0.1, 10], in typical cases.In some of our tests much higher values were used with good effect, whereas smaller values typically lead to stepsize restriction.Generally, we can use stepsizes for AdLaLa which are similar to or even larger than those for SGD or SGLD, but for some of the harder problems the stepsize needed to be modestly reduced.For all spirals examples using a SHLP, the following parameter choices work well: AdLaLa: h = 0.15, τ 1 = τ 2 = 10 −4 , γ = 0.1, σ A = 0.01, = 0.1.For LOL, there are fewer parameters to set; good choices appear to be γ 1 = 0.01 and τ 1 = 10 −3 for a SHLP.In experiments with Adam we used the hyperparameters recommended in the original article [24], namely β 1 = 0.9, β 2 = 0.999, adam = 10 −8 , and did not change the learning rate throughout training.

Comparison of classifiers.
The enhanced performance of AdLaLa vs Adam for the difficult 4-turns spiral dataset can be seen by comparing the classifiers they each produce (see Fig. 9).We observe that the AdLaLa classifiers are far better resolved: they have lower loss and higher test accuracy, and are, moreoever, smoother.
6.2.1.Evolution of the weights and classifier boundaries during training.We studied the development of the classification boundary between the two spiral classes as training progresses.We observe that no matter which optimizer has been used, all of the classifiers tend to distinguish the outer part of the two spirals first, before slowly filling in the classification plane inwards.Early on in the training, there are weights which are assigned a specific role in fixing the shape of the classification boundary in the outer part of the spirals.These weights typically keep the same role throughout training.
We also isolated the effect of single data points on the training procedure.We distinguish data points from the center of the spirals and data points in the outer part of the spirals.We observed that for SGD data points from the outer part caused a larger change in the weight/bias values than the inner data points (at least initially, it typically changes after 2000 steps or so).For AdLaLa, however, the inner and outer data points affected the weights more or less equally from the outset.6.3.Thermodynamic parameterization methods can have high accuracy and rapid convergence.We provide evidence that our methods LOL and Ad-LaLa are able to converge more rapidly to a low test-loss parameterization than standard optimizers such as SGD, SGLD or Adam, for the spirals and trigonometric datasets.Our methods also perform competitively on the MNIST dataset compared to standard optimizers, but do not significantly outperform other methods for this problem.16)) generated by Adam (top row) vs AdLaLa (bottom row).For Adam the stepsize used was h = 0.005.Adam was initialized with Gaussian weights with standard deviation 0.5.For AdLaLa the parameters were = 0.1, τ 1 = 0.0001, σ A = 0.01, γ 2 = 0.03, τ 2 = 0.00001, h = 0.1.Weights were initialized as Gaussian with standard deviation 0.01.For both methods we used 2% subsampling per step.From left to right in each row: 20K steps (400 epochs); 40K steps (800 epochs); 60K steps (1200 epochs).For visualization the classifier was averaged over the last 10 steps of training.
In the following experiment we show the superiority of our AdLaLa method on the spirals dataset by fixing the parameters of AdLaLa, but varying the parameters of the other methods (at this point we only varied the stepsize for Adam, not its default parameters, i.e. we did not change the decay rates for the moving averages of the first and second moments).We show that AdLaLa consistently outperforms the other methods in terms of convergence rate.The experiments were performed using a neural network with a single hidden layer consisting of 100 nodes, 1000 test data, 1000 training data and 2% subsampling.We present comparisons for the spirals 4-turn dataset (Fig. 10).We ran similar comparisons for easier 3-turn spiral data and observed similar trends.The amount of subsampling did not seem to affect the results much.
We also show for the planar trigonometric example with a = 6, b = 1, c = 0.02 in Eq. ( 17) that our methods, LOL and AdLaLa, outperform Adam in terms of convergence rate (see Fig. 11).Even at its (for this example) optimal time stepsize of h = 0.01 Adam is almost three times as slow as AdLaLa in obtaining 90% test accuracy.In our tests on a harder example, which exhibits more crossings of the two data classes, namely a = 10 in Eq. ( 17), Adam was never able to reach the accuracy that LOL and AdLaLa obtain (see Fig. 12).Its progress slows down rapidly and halts completely after 40,000 steps.After 100,000 steps its maximum test accuracy is still around 73%. SGLD is not able to compete at all.17)) with a 100 node SHLP, which was parameterized using different optimizers.The results were averaged over 20 runs.Hyperparameters settings: for LOL: h = 0.1, γ = τ 1 = 10 −3 ; for AdLaLa: h = 0.1, τ 1 = τ 2 = 10 −4 , γ = 5, σ A = 0.001, = 0.1; for SGLD: h = 0.1, σ = 0.01.

Thermodynamic parameterization methods can reduce overfitting.
In this section we evaluate the robustness of our algorithms to overfitting.Overfitting is defined as the increase in test loss over time as the optimizer "overfits" on the provided training data and therefore has a reduced generalization performance.To emphasize the overfitting effect, we shall decrease the amount of our training data relative to our test data, namely we shall use 200 training data points vs 4000 test data points.We also increase the noise level in our 2-turn spiral dataset to c = 0.1 and use a 500 node SHLP.16)).We used h SGD = 0.1, h Adam = 0.005, for LOL: h = 0.1, γ 1 = 1, τ 1 = 10 −6 , for Ad-LaLa: h = 0.1, τ 1 = 10 −4 , τ 2 = 10 −8 , γ = 1000, σ A = 0.01, = 0.1.
In Fig. 13 one observes that SGD clearly overfits in the sense that after a certain time its test loss monotonically increases with the number of steps.In contrast, LOL with a large enough value of γ 1 can be shown to not exhibit this behaviour.The same can be said for AdLaLa, but only after a careful selection of the method's parameter values.We note that for these parameter settings LOL and AdLaLa are slower in reaching the desired test and training accuracy, but this leads to more stability later on in the training process and limits the need for early stopping techniques.We do not claim that our methods universally counter overfitting, merely that they allow more flexibility which can lead to increased robustness to overfitting.6.5.Thermodynamic parameterization methods are more robust than ADAM and SGD.We show that for the two turn spiral problem, Adam and SGD have a larger variance in their test accuracies over different runs than AdLaLa or LOL.We ran each of the optimizers 100 times and plotted the variance of the obtained test accuracies (see Fig. 14  The behaviour of Adam is highly dependent on the choice of initialization, while AdLaLa is less sensitive.We illustrate this by using both the standard PyTorch initialization [14,38] for the weights (Adam is light blue and AdLaLa is green in Fig. 14) and using a Gaussian initialization (Adam is dark blue and AdLaLa is purple in Fig. 14).We also use Gaussian initialization for the other methods: SGD (red) and LOL (yellow).We observe that our methods -yellow (LOL) and green/purple (AdLaLa) in Fig. 14 -have a much lower variance in their obtained test accuracies than Adam (with both initializations) and SGD.6.6.Role of additive noise σ A in AdLaLa.As we can expect that gradient subsampling will introduce noise into the system, it is not immediately clear what the benefit of including additive noise is in the AdLaLa scheme (see Eq. ( 15)).However, we demonstrate in Fig. 15 that choosing an appropriate noise strength σ A > 0 can provide faster convergence to high quality minima.
We run experiments on classifying the four turn spiral problem using an SHLP with 100 hidden nodes.We draw 1000 data points as training data and use 2% subsampling for computing the gradient, with the test accuracy computed from 1000 independently drawn points.The parameters in the second layer are fixed for all experiments at γ 2 = 0.03 and τ 2 = 10 −8 , with = 0.1.We look at the performance of the scheme for different values of τ 1 and σ A by plotting the test accuracy (averaged over ten independent runs) after 50K steps with h = 0.1.The results in Fig. 15 demonstrate that there is a broad range (at least an order of magnitude) where using additive noise significantly improves the performance of the classifier.We observe that reducing the strength of the additive noise too much (choosing σ A < 10 −3 for example), or removing it entirely by setting σ A = 0, gives very poor results for the overall classification, with results no better than random noise.By contrast, we are able to recover near 100% accuracy for the same computational cost and with the same parameters by including additive noise of sufficient strength (for example choosing σ A = 0.04).
The performance of the scheme seems relatively agnostic to the choice of target temperature parameter τ 1 , provided it is sufficiently small.At too large a temperature the system is prevented from converging to an energy minima, leading to poor classification accuracy.6.7.Role of Temperature in Partitioned Schemes.In subsection 2.3, we showed in Fig. 1 that the Langevin schemes could be more accurate when used with higher temperature.We close this series of numerical experiments with a demonstration using the 4-turn spiral data that the LOL method similarly is more accurate at modest temperatures (i.e.there is a band of temperature for which LOL performance improves), see Fig. 16.Here performance increased with increasing τ until τ = 0.00001 after which it began to decrease.(The method is unusable already for τ = 0.001.) 7. Conclusion and Outlook.We have presented a new approach to parameterization of neural networks which can, in challenging data classification problems, accelerate convergence and provide improved test accuracy.The use of additive noise to supplement gradient noise was already proposed in previous works of other authors.We draw on this, by combining it with state-of-the-art principles for sampling algorithms coming from molecular dynamics and deploy partitioned algorithms that substantially improve on SGD and other optimization procedures.These new methods have other advantages -for one thing they appear not to require additional regularization to obtain good performance (we did not use regularization in our experiments).Another advantage is that the stochastic methods, namely partitioned Langevin, LOL and AdLaLa, do not require complex initialization in the cases we studied.In fact, we initialized them frequently from zero initial weights and momenta and sometimes using built in training package procedures such as that in DLIB and PyTorch.This did not seem to significantly impact their performance.
The implementation of many of these schemes is straightforward, although obviously any major code project will require substantial investment of time and planning if the result is to be reliable software which is scalable to large data sets and network sizes.As preliminary groundwork, we have already released a software package called TATi (Thermodynamic Analytics Toolkit) which implements Langevin dynamics methods on the TensorFlow platform. 3We hope to extend this software package in the near future to also implement the more complex partitioned methods discussed in the article.
In terms of future directions for research, we mention several important challenges.First, the experiments of this article have all focussed on a limited collection of toy data sets, specifically classification problems for planar data.These present some difficulty for common training methods, so they are a good first step, but it is natural to look next at some state-of-the-art challenges such as arise in large scale image classification or natural language processing.Second, the power of these methods will not be fully recognized by the field until the results are demonstrated in deep networks, which are increasingly popular for machine learning applications due to better accuracy and generalization capabilities.We have in fact implemented the methods already for such networks and we expect to publish a paper on this topic soon.
Finally, we highlight the improved generalization properties of the models trained using our methods, as demonstrated in our experiments.Nowhere is the problem of poor generalization more acute than in the study of streaming data, where the continual perturbation of the data leads to aging of parameter sets which can necessitate frequent costly reparameterization.We therefore look to this topic for a rich source of problems to test out our methods in the future.

Figure 1 .
Figure 1.The figure shows classifiers computed using the BAOAB Langevin dynamics integrator.Visually, good classification is obtained if the contrast is high between the color of plotted data and the color of the classifier, thus indicating a clear separation of the two sets of labelled data points.The same stepsize (h = 0.4) and total number of steps N = 50, 000 was used in each training run.The friction was also held fixed at γ = 10.A 500 node SHLP was used with ReLU activation, sigmoidal output and a standard cross entropy loss function.The temperatures were set to τ =1e-8 (upper left), τ =1e-7 (upper right), τ =1e-6 (lower left) and τ =1e-5 (lower right).The figures show that the classifier substantially improves as the temperature is raised.The test accuracies for each run are also shown at the top of each figure.The data is given by Eq. (17) with a = 3, b = 2 and c = 0.02.We used 1000 training, 1000 test data points and 2% subsampling.

Figure 2 .
Figure 2. Spiral data and trigonometric data typical of those used in our classification studies.

Figure 3 .
Figure3.Left: graph of the loss along the line(18) for the MNIST dataset.It is clear that AdLaLa and Adam converge to different minima, although we used the exact same initialization for both methods.There is no evidence of a loss-barrier.Their final test loss is similar.Right: the same construct for a simple spiral with one turn, i.e., b = 1 in Eq. (16).As for MNIST there is no evidence of a loss-barrier.

Figure 10 .%Figure 11 .
Figure 10.AdLaLa (black dotted horizontal line in both figures) consistently outperforms SGD, SGLD (left figure) and Adam (right figure) for the spiral 4-turn dataset.The different bars in the left figure indicate SGLD with different values of σ, namely σ = 0 (blue, this is standard SGD), σ = 0.005 (red), σ = 0.01 (yellow), σ = 0.05 (purple), σ = 0.1 (green).Whereas the set of parameter values for AdLaLa is fixed, the parameters of the other methods were varied to show the general superiority of AdLaLa.The results were averaged over multiple runs and the same initial conditions were used for all runs.The parameters used for AdLaLa were h = 0.25, τ 1 = τ 2 = 10 −4 , γ = 0.1, σ A = 0.01, = 0.05.

Figure 15 .
Figure 15.We run the AdLaLa scheme on an SHLP with 100 hidden nodes on the four turn spiral problem.Pixels indicate the average test accuracy with corresponding parameters, from ten independent runs, where γ 2 = 0.03, = 0.1, τ 2 = 10 −8 , and h = 0.1.

Figure 16 .
Figure 16.Comparison of classifiers for a 200-node SHLP on 4-turn spiral data generated by LOL with different temperature values.The friction was set at 1 in all experiments and 50,000 steps were performed with stepsize 0.8 (similar to large stepsizes used in SGD).Here performance increased with increasing τ until τ = 0.00001 after which it began to decrease.(The method is unusable already for τ = 0.001.) ).