# American Institute of Mathematical Sciences

December  2019, 1(4): 457-489. doi: 10.3934/fods.2019019

## Partitioned integrators for thermodynamic parameterization of neural networks

 School of Mathematics and Maxwell Institute for the Mathematical Sciences, University of Edinburgh, Edinburgh EH9 3FD, United Kingdom

* Corresponding author: Benedict Leimkuhler

Published  December 2019

Traditionally, neural networks are parameterized using optimization procedures such as stochastic gradient descent, RMSProp and ADAM. These procedures tend to drive the parameters of the network toward a local minimum. In this article, we employ alternative "sampling" algorithms (referred to here as "thermodynamic parameterization methods") which rely on discretized stochastic differential equations for a defined target distribution on parameter space. We show that the thermodynamic perspective already improves neural network training. Moreover, by partitioning the parameters based on natural layer structure we obtain schemes with very rapid convergence for data sets with complicated loss landscapes.

We describe easy-to-implement hybrid partitioned numerical algorithms, based on discretized stochastic differential equations, which are adapted to feed-forward neural networks, including a multi-layer Langevin algorithm, AdLaLa (combining the adaptive Langevin and Langevin algorithms) and LOL (combining Langevin and Overdamped Langevin); we examine the convergence of these methods using numerical studies and compare their performance among themselves and in relation to standard alternatives such as stochastic gradient descent and ADAM. We present evidence that thermodynamic parameterization methods can be (ⅰ) faster, (ⅱ) more accurate, and (ⅲ) more robust than standard algorithms used within machine learning frameworks.

Citation: Benedict Leimkuhler, Charles Matthews, Tiffany Vlaar. Partitioned integrators for thermodynamic parameterization of neural networks. Foundations of Data Science, 2019, 1 (4) : 457-489. doi: 10.3934/fods.2019019
##### References:

show all references

##### References:
The figure shows classifiers computed using the BAOAB Langevin dynamics integrator. Visually, good classification is obtained if the contrast is high between the color of plotted data and the color of the classifier, thus indicating a clear separation of the two sets of labelled data points. The same stepsize ($h = 0.4$) and total number of steps $N = 50,000$ was used in each training run. The friction was also held fixed at $\gamma = 10$. A 500 node SHLP was used with ReLU activation, sigmoidal output and a standard cross entropy loss function. The temperatures were set to $\tau =$1e-8 (upper left), $\tau =$1e-7 (upper right), $\tau =$1e-6 (lower left) and $\tau =$1e-5 (lower right). The figures show that the classifier substantially improves as the temperature is raised. The test accuracies for each run are also shown at the top of each figure. The data is given by Eq. (17) with a = 3, b = 2 and c = 0.02. We used 1000 training, 1000 test data points and 2% subsampling
Spiral data and trigonometric data typical of those used in our classification studies
Left: graph of the loss along the line (18) for the MNIST dataset. It is clear that AdLaLa and Adam converge to different minima, although we used the exact same initialization for both methods. There is no evidence of a loss-barrier. Their final test loss is similar. Right: the same construct for a simple spiral with one turn, i.e., $b = 1$ in Eq. (16). As for MNIST there is no evidence of a loss-barrier
The left and right plots are for two runs with the same parameters but different initializations. We train a 20 node SHLP on the two turn spiral dataset, i.e., $b = 2$ in Eq. (16), for 20,000 steps, with 500 training and test data points and 5% subsampling. Left: The parameterization that AdLaLa finds gives: 100% train, 99% test. Adam gets: 88% train, 91% test; Right: AdLaLa: 100 % train, 98 % test. Adam: 96% train, 94 % test
MNIST (left) vs. Spirals (2-turn) (right) on Test
Weight and bias distributions for the 2-turn spirals dataset at different times and for different methods. Parameter settings: $h_{\text{SGD}} = 0.2, h_{\text{Adam}} = 0.005$, SGLD: $h_{\text{SGLD}} = 0.1$ and $\sigma_{\text{SGLD}} = 0.01$. AdLaLa: $h_{\text{AdLaLa}} = 0.25, \sigma_A = 0.01$, $\tau_1 = \tau_2 = 10^{-4}, \epsilon = 0.1$ and $\gamma = 0.5$. Test accuracy at step 50: 0.66 (SGD), 0.65 (Adam), 0.61 (SGLD), 0.62 (AdLaLa); at step 1000: 0.66 (SGD), 0.89 (Adam), 0.68 (SGLD), 0.82 (AdLaLa); at step 10000: 0.96 (SGD), 0.99 (Adam), 0.74 (SGLD), 0.99 (AdLaLa)
Evolution of weights for the 4-turn spiral problem. Same parameter settings as in Fig. 6, but $\gamma = 0.1$ in AdLaLa. Test accuracy at step 50: 0.5 (SGD), 0.58 (Adam), 0.52 (SGLD), 0.45 (AdLaLa); at step 1000: 0.56 (SGD), 0.55 (Adam), 0.5 (SGLD), 0.62 (AdLaLa); at step 10k: 0.58 (SGD), 0.67 (Adam), 0.54 (SGLD), 0.8 (AdLaLa)
Obtained parameter distributions over 100 runs after using different optimizers for the 2-turn spiral problem for 10K steps. Parameter settings: $h_{\text{SGD}} = 0.1, h_{\text{Adam}} = 0.005, h_{\text{SGLD}} = 0.1, \sigma_{\text{SGLD}} = 0.1$, AdLaLa has $h_{\text{AdLaLa}} = 0.25, \tau_1 = \tau_2 = 10^{-4}, \sigma_A = 0.01, \epsilon = 0.1$, $\gamma = 0.5$ (left) and $\gamma = 10$ (right). Average test accuracies: SGD: 79%, Adam: 83.7%, SGLD: 78%, AdLaLa ($\gamma = 0.5$): 93.4%, AdLaLa ($\gamma = 10$): 85.5%
Comparison of classifiers for a 500-node SHLP on 4-turn spiral data (with $a = 2, b = 4, c = 0.02, p = 1$ in Eq. (16)) generated by Adam (top row) vs AdLaLa (bottom row). For Adam the stepsize used was $h = 0.005$. Adam was initialized with Gaussian weights with standard deviation 0.5. For AdLaLa the parameters were $\epsilon = 0.1$, $\tau_1 = 0.0001$, $\sigma_A = 0.01$, $\gamma_2 = 0.03$, $\tau_2 = 0.00001, h = 0.1$. Weights were initialized as Gaussian with standard deviation 0.01. For both methods we used 2% subsampling per step. From left to right in each row: 20K steps (400 epochs); 40K steps (800 epochs); 60K steps (1200 epochs). For visualization the classifier was averaged over the last 10 steps of training
AdLaLa (black dotted horizontal line in both figures) consistently outperforms SGD, SGLD (left figure) and Adam (right figure) for the spiral 4-turn dataset. The different bars in the left figure indicate SGLD with different values of $\sigma$, namely $\sigma = 0$ (blue, this is standard SGD), $\sigma$ = 0.005 (red), $\sigma$ = 0.01 (yellow), $\sigma$ = 0.05 (purple), $\sigma$ = 0.1 (green). Whereas the set of parameter values for AdLaLa is fixed, the parameters of the other methods were varied to show the general superiority of AdLaLa. The results were averaged over multiple runs and the same initial conditions were used for all runs. The parameters used for AdLaLa were $h = 0.25, \tau_1 = \tau_2 = 10^{-4}, \gamma = 0.1, \sigma_A = 0.01, \epsilon = 0.05$
Test loss/accuracy obtained for planar trigonometric data (with a = 6 in Eq. (17)) using different optimizers and a 100 node SHLP, 1000 test data, 1000 training data and 5% subsampling. The parameters for LOL are set to $h = 0.1, \gamma_1 = 0.01, \tau_1 = 10^{-3}$. For AdLaLa we used parameters: $h = 0.2, \tau_1 = \tau_2 = 10^{-4}, \gamma = 10, \sigma_A = 0.001, \epsilon = 0.1$
Test loss/accuracy obtained for planar trigonometric data (with a = 10 in Eq. (17)) with a 100 node SHLP, which was parameterized using different optimizers. The results were averaged over 20 runs. Hyperparameters settings: for LOL: $h = 0.1, \gamma_1 = 0.01, \tau_1 = 10^{-3}$; for AdLaLa: $h = 0.1, \tau_1 = \tau_2 = 10^{-4}, \gamma = 5, \sigma_A = 0.001, \epsilon = 0.1$; for SGLD: $h = 0.1, \sigma = 0.01$
Obtained while training a 500-node SHLP on the 2-turn spiral (with $c = 0.1$ in Eq. (16)). We used $h_{\text{SGD}} = 0.1, h_{\text{Adam}} = 0.005$, for LOL: $h = 0.1, \gamma_1 = 1, \tau_1 = 10^{-6}$, for AdLaLa: $h = 0.1, \tau_1 = 10^{-4}, \tau_2 = 10^{-8}, \gamma = 1000, \sigma_A = 0.01, \epsilon = 0.1$
Variance (top) and mean (bottom) in test accuracies obtained over 100 runs on the two-turn spiral problem using SGD (red) with $h = 0.25$, Adam (dark blue) with $h = 0.005$ and 0.01 $\cdot \mathcal{N}(0,1)$ initialization for the weights, Adam (light blue) with $\mathcal{U}(-1/\sqrt{N_{in}},1/\sqrt{N_{in}})$ (standard PyTorch) initialization for the weights (where $N_{in}$ is the number of inputs to the layer), LOL (yellow) with $h = 0.25, \gamma_1 = 0.01, \tau_1 = 10^{-3}$, and AdLaLa (purple) with $h = 0.25, \tau_1 = \tau_2 = 10^{-4}, \gamma = 0.5, \sigma_A = 0.01, \epsilon = 0.1$ with Gaussian initialization, AdLaLa (green) with standard PyTorch initialization. We used a 20 node SHLP, 500 training data and 2% subsampling
We run the AdLaLa scheme on an SHLP with 100 hidden nodes on the four turn spiral problem. Pixels indicate the average test accuracy with corresponding parameters, from ten independent runs, where $\gamma_2 = 0.03$, $\epsilon = 0.1$, $\tau_2 = 10^{-8}$, and $h = 0.1$
Comparison of classifiers for a 200-node SHLP on 4-turn spiral data generated by LOL with different temperature values. The friction was set at 1 in all experiments and 50,000 steps were performed with stepsize 0.8 (similar to large stepsizes used in SGD). Here performance increased with increasing $\tau$ until $\tau = 0.00001$ after which it began to decrease. (The method is unusable already for $\tau = 0.001$.)
 [1] Predrag S. Stanimirović, Branislav Ivanov, Haifeng Ma, Dijana Mosić. A survey of gradient methods for solving nonlinear optimization. Electronic Research Archive, 2020, 28 (4) : 1573-1624. doi: 10.3934/era.2020115 [2] Alberto Bressan, Sondre Tesdal Galtung. A 2-dimensional shape optimization problem for tree branches. Networks & Heterogeneous Media, 2020  doi: 10.3934/nhm.2020031 [3] Haiyu Liu, Rongmin Zhu, Yuxian Geng. Gorenstein global dimensions relative to balanced pairs. Electronic Research Archive, 2020, 28 (4) : 1563-1571. doi: 10.3934/era.2020082 [4] Jianhua Huang, Yanbin Tang, Ming Wang. Singular support of the global attractor for a damped BBM equation. Discrete & Continuous Dynamical Systems - B, 2020  doi: 10.3934/dcdsb.2020345 [5] Cheng He, Changzheng Qu. Global weak solutions for the two-component Novikov equation. Electronic Research Archive, 2020, 28 (4) : 1545-1562. doi: 10.3934/era.2020081 [6] Zhenzhen Wang, Tianshou Zhou. Asymptotic behaviors and stochastic traveling waves in stochastic Fisher-KPP equations. Discrete & Continuous Dynamical Systems - B, 2020  doi: 10.3934/dcdsb.2020323 [7] Weiwei Liu, Jinliang Wang, Yuming Chen. Threshold dynamics of a delayed nonlocal reaction-diffusion cholera model. Discrete & Continuous Dynamical Systems - B, 2020  doi: 10.3934/dcdsb.2020316 [8] Manil T. Mohan. First order necessary conditions of optimality for the two dimensional tidal dynamics system. Mathematical Control & Related Fields, 2020  doi: 10.3934/mcrf.2020045 [9] Cuicui Li, Lin Zhou, Zhidong Teng, Buyu Wen. The threshold dynamics of a discrete-time echinococcosis transmission model. Discrete & Continuous Dynamical Systems - B, 2020  doi: 10.3934/dcdsb.2020339 [10] Shao-Xia Qiao, Li-Jun Du. Propagation dynamics of nonlocal dispersal equations with inhomogeneous bistable nonlinearity. Electronic Research Archive, , () : -. doi: 10.3934/era.2020116 [11] Ebraheem O. Alzahrani, Muhammad Altaf Khan. Androgen driven evolutionary population dynamics in prostate cancer growth. Discrete & Continuous Dynamical Systems - S, 2020  doi: 10.3934/dcdss.2020426 [12] Lorenzo Zambotti. A brief and personal history of stochastic partial differential equations. Discrete & Continuous Dynamical Systems - A, 2021, 41 (1) : 471-487. doi: 10.3934/dcds.2020264 [13] Ahmad Z. Fino, Wenhui Chen. A global existence result for two-dimensional semilinear strongly damped wave equation with mixed nonlinearity in an exterior domain. Communications on Pure & Applied Analysis, 2020, 19 (12) : 5387-5411. doi: 10.3934/cpaa.2020243 [14] Mengni Li. Global regularity for a class of Monge-Ampère type equations with nonzero boundary conditions. Communications on Pure & Applied Analysis, , () : -. doi: 10.3934/cpaa.2020267 [15] Bo Chen, Youde Wang. Global weak solutions for Landau-Lifshitz flows and heat flows associated to micromagnetic energy functional. Communications on Pure & Applied Analysis, , () : -. doi: 10.3934/cpaa.2020268 [16] José Luis López. A quantum approach to Keller-Segel dynamics via a dissipative nonlinear Schrödinger equation. Discrete & Continuous Dynamical Systems - A, 2020  doi: 10.3934/dcds.2020376 [17] A. M. Elaiw, N. H. AlShamrani, A. Abdel-Aty, H. Dutta. Stability analysis of a general HIV dynamics model with multi-stages of infected cells and two routes of infection. Discrete & Continuous Dynamical Systems - S, 2020  doi: 10.3934/dcdss.2020441 [18] Siyang Cai, Yongmei Cai, Xuerong Mao. A stochastic differential equation SIS epidemic model with regime switching. Discrete & Continuous Dynamical Systems - B, 2020  doi: 10.3934/dcdsb.2020317 [19] Xuhui Peng, Rangrang Zhang. Approximations of stochastic 3D tamed Navier-Stokes equations. Communications on Pure & Applied Analysis, 2020, 19 (12) : 5337-5365. doi: 10.3934/cpaa.2020241 [20] Yahia Zare Mehrjerdi. A new methodology for solving bi-criterion fractional stochastic programming. Numerical Algebra, Control & Optimization, 2020  doi: 10.3934/naco.2020054

Impact Factor: