December  2019, 1(4): 457-489. doi: 10.3934/fods.2019019

Partitioned integrators for thermodynamic parameterization of neural networks

School of Mathematics and Maxwell Institute for the Mathematical Sciences, University of Edinburgh, Edinburgh EH9 3FD, United Kingdom

* Corresponding author: Benedict Leimkuhler

Published  December 2019

Traditionally, neural networks are parameterized using optimization procedures such as stochastic gradient descent, RMSProp and ADAM. These procedures tend to drive the parameters of the network toward a local minimum. In this article, we employ alternative "sampling" algorithms (referred to here as "thermodynamic parameterization methods") which rely on discretized stochastic differential equations for a defined target distribution on parameter space. We show that the thermodynamic perspective already improves neural network training. Moreover, by partitioning the parameters based on natural layer structure we obtain schemes with very rapid convergence for data sets with complicated loss landscapes.

We describe easy-to-implement hybrid partitioned numerical algorithms, based on discretized stochastic differential equations, which are adapted to feed-forward neural networks, including a multi-layer Langevin algorithm, AdLaLa (combining the adaptive Langevin and Langevin algorithms) and LOL (combining Langevin and Overdamped Langevin); we examine the convergence of these methods using numerical studies and compare their performance among themselves and in relation to standard alternatives such as stochastic gradient descent and ADAM. We present evidence that thermodynamic parameterization methods can be (ⅰ) faster, (ⅱ) more accurate, and (ⅲ) more robust than standard algorithms used within machine learning frameworks.

Citation: Benedict Leimkuhler, Charles Matthews, Tiffany Vlaar. Partitioned integrators for thermodynamic parameterization of neural networks. Foundations of Data Science, 2019, 1 (4) : 457-489. doi: 10.3934/fods.2019019
References:
[1]

A. Avati, K. Jung, S. Harman, L. Downing, A. Ng and N. Shah, Improving palliative care with deep learning, 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), (2017). doi: 10.1109/BIBM.2017.8217669.  Google Scholar

[2]

A. J. BallardR. DasS. MartinianiD. MehtaL. SagunJ. D. Stevenson and D. J. Wales, Energy landscapes for machine learning, Phys. Chem. Chem. Phys., 19 (2017), 12585-12603.  doi: 10.1039/C7CP01108C.  Google Scholar

[3]

N. Brosse, A. Durmus and E. Moulines, The promises and pitfalls of stochastic gradient Langevin dynamics, NIPS, (2018), 8268–8278. Google Scholar

[4]

A. ChoromanskaM. HenaffM. MathieuG. Arous and Y. LeCun, The loss surfaces of multilayer networks, Journal of Machine Learning Research, 38 (2015), 192-204.   Google Scholar

[5]

Y. Dauphin, R. Pascanum C. Gülçehre, K. Cho, S. Ganguli and Y. Bengio, Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, NIPS, (2014). Google Scholar

[6]

N. Ding, Y. Fang, R. Babbush, C. Chen, R. D. Skeel and H. Neven, Bayesian sampling using stochastic gradient thermostats, NIPS, (2014), 3203–3211. Google Scholar

[7]

J. DolbeaultC. Mouhot and C. Schmeiser, Hypocoercivity for kinetic equations with linear relaxation terms, C. R. Math. Acad. Sci. Paris, 347 (2009), 511-516.  doi: 10.1016/j.crma.2009.02.025.  Google Scholar

[8]

J. DuchiE. Hazan and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, Journal of Machine Learning Research, 12 (2011), 2121-2159.   Google Scholar

[9]

A. Durmus and E. Moulines, Non-asymptotic convergence analysis for the unadjusted Langevin algorithm, The Annals of Applied Probability, 27 (2017), 1551-1587.  doi: 10.1214/16-AAP1238.  Google Scholar

[10]

C. Gardiner, Handbook of Stochastic Methods for Physics, Chemistry, and the Natural Sciences, 3rd edn. Springer, New York, 2004. doi: 10.1007/978-3-662-05389-8.  Google Scholar

[11]

C. J. Geyer, Markov Chain Monte Carlo Maximum Likelihood, , Computer Science and Statistics, 1991. Google Scholar

[12]

X. Glorot, A. Bordes and Y. Bengio, Deep Sparse Rectifier Networks, AISTATS, 2011. Google Scholar

[13]

I. J. Goodfellow, O. Vinyals and A. M. Saxe, Qualitatively characterizing neural network optimization problems, ICLR, 2015. Google Scholar

[14]

K. He, X. Zhang, S. Ren and J. Sun, Delving deep into rectifiers: Surpassing human-level performance on Imagenet classification, Proceedings of the IEEE international conference on computer vision, (2015), 1026–1034. Google Scholar

[15]

D. P. Herzog, Exponential relaxation of the Nosé-Hoover equation under Brownian heating, Communications in Mathematical Sciences, 16 (2018), 2231-2260.  doi: 10.4310/CMS.2018.v16.n8.a8.  Google Scholar

[16]

A. Hoerl and R. Kennard, Ridge regression: Biased estimation for nonorthogonal problems, Technometrics, 12 (1970), 55-67.   Google Scholar

[17]

W. Hoover, Canonical dynamics: Equilibrium phase-space distributions, Phys. Rev. A., 31 (1985), 1695-1697.  doi: 10.1103/PhysRevA.31.1695.  Google Scholar

[18]

W. R. Huang, Z. Emam, M. Goldblum, L. Fowl, J. K. Terry, F. Huang and T. Goldstein, Understanding generalization through visualizations, arXiv: 1906.03291, (2019). Google Scholar

[19]

D. J. Im, M. Tao and K. Branson, An empirical analysis of deep network loss surfaces, CoRR, arXiv: 1612.04010, (2016). Google Scholar

[20]

K. Jarrett, K. Kavukcuoglu, M. Ranzato and Y. LeCun, What is the best multi-stage architecture for object recognition?, ICCV, (2009). doi: 10.1109/ICCV.2009.5459469.  Google Scholar

[21]

S. Jastrzȩbski, Z, Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio and A. J. Storkey, Three factors influencing minima in SGD, CoRR, arXiv: 1711.04623, (2017). Google Scholar

[22]

A. Jones and B. Leimkuhler, Adaptive stochastic methods for sampling driven molecular systems, The Journal of Chemical Physics, 135 (2011), 084125. doi: 10.1063/1.3626941.  Google Scholar

[23]

D. King, Dlib-ml: A machine learning toolkit, Journal of Machine Learning Research, 10 (2009), 1755-1758.   Google Scholar

[24]

D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, ICLR, (2015). Google Scholar

[25]

S. KirkpatrickC. D. Gelatt and M. P. Vecchi, Optimization by simulated annealing, Science, 220 (1983), 671-680.  doi: 10.1126/science.220.4598.671.  Google Scholar

[26]

H. Kushner and G. G. Yin, Stochastic Approximation and Recursive Algorithms and Applications, Second edition. Applications of Mathematics (New York), 35. Stochastic Modelling and Applied Probability. Springer-Verlag, New York, 2003.  Google Scholar

[27]

J. Lan, R. Liu, H. Zhou and J. Yosinski, LCA: Loss change allocation for neural network training, preprint, arXiv: 1909.01440, (2019). Google Scholar

[28]

B. Leimkuhler and C. Matthews, Molecular Dynamics: With Deterministic and Stochastic Numerical Methods, Interdisciplinary Applied Mathematics, Springer, 2015. doi: 10.1007/978-3-319-16375-8.  Google Scholar

[29]

B. LeimkuhlerC. Matthews and G. Stoltz, The computation of averages from equilibrium and nonequilibrium Langevin molecular dynamics, IMA Journal of Numerical Analysis, 36 (2016), 13-79.  doi: 10.1093/imanum/dru056.  Google Scholar

[30]

B. Leimkuhler, M. Sachs and G. Stoltz, Hypocoercivity properties of adaptive Langevin dynamics, preprint, arXiv: 1908.09363, (2019). Google Scholar

[31]

B. Leimkuhler and X. Shang, Adaptive thermostats for noisy gradient systems, SIAM Journal on Scientific Computing, 38 (2016), A712–A736. doi: 10.1137/15M102318X.  Google Scholar

[32]

E. Marinari and G. Parisi, Simulated tempering: A new Monte Carlo scheme, Europhysics Letters, 19 (1992). doi: 10.1209/0295-5075/19/6/002.  Google Scholar

[33]

J. C. MattinglyA. M. Stuart and D. J. Higham, Ergodicity for SDEs and approximations: locally Lipschitz vector fields and degenerate noise, Stochastic Processes and their Applications, 101 (2002), 185-232.  doi: 10.1016/S0304-4149(02)00150-3.  Google Scholar

[34]

S. P. Meyn and R. L. Tweedie, Stability of Markovian processes Ⅱ: Continuous-time processes and sampled chains, Advances in Applied Probability, 25 (1993), 487-517.  doi: 10.2307/1427521.  Google Scholar

[35] K.P. Murphy, Machine Learning: A Probabilistic Perspective, MIT Press, 2012.   Google Scholar
[36]

R. M. Neal, Bayesian Learning for Neural Networks, Springer-Verlag, New York, 1996. doi: 10.1007/978-1-4612-0745-0.  Google Scholar

[37]

B. Neyshabur, R. Tomioka and N. Srebro, In search of the real inductive bias: On the role of implicit regularization in deep learning, Proceeding of the International Conference on Learning Representations workshop track, arXiv: 1412.6614, (2015). Google Scholar

[38]

S. Nosé, A unified formulation of the constant temperature molecular dynamics methods, The Journal of Chemical Physics, 81 (1984), 511-519.   Google Scholar

[39]

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga and A. Lerer, Automatic differentiation in PyTorch, (2017). Google Scholar

[40]

E. PollakA. Auerbach and P. Talkner, Observations on rate theory for rugged energy landscapes, Biophysical Journal, 95 (2008), 4258-4265.   Google Scholar

[41]

G. O. Roberts and R. L. Tweedie, Exponential convergence of Langevin distributions and their discrete approximations, Bernoulli, 2 (1996), 341-363.  doi: 10.2307/3318418.  Google Scholar

[42]

M. Sachs, B. Leimkuhler and V. Danos, Langevin dynamics with variable coefficients and nonconservative forces: from stationary states to numerical methods, Entropy, 19 (2017), 647. doi: 10.3390/e19120647.  Google Scholar

[43]

L. Sagun, L. Bottou and Y. LeCun, Singularity of the Hessian in deep learning, ICLR, (2017). Google Scholar

[44]

L. Sagun, U. Evci, U. Güney, Y. Dauphin and L. Bottou, Empirical analysis of the Hessian of over-parametrized neural networks, ICLR, arXiv: 1706.04454, (2018). Google Scholar

[45]

K. T. Schütt, F. Arbabzadah, S. Chmiela, K. R. Müller and A. Tkatchenko, Quantum-chemical insights from deep tensor neural networks, Nature Communications, 8 (2017). Google Scholar

[46]

D. SilverT. HubertJ. SchrittwieserI. AntonoglouM. LaiA. GuezM. LanctotL. SifreD. KumaranT. GraepelT. LillicrapK. Simonyan and D. Hassabis, A general reinforcement learning algorithm that masters Chess, Shogi, and Go through self-play, Science, 362 (2018), 1140-1144.  doi: 10.1126/science.aar6404.  Google Scholar

[47]

B. Singh, S. De, Y. Zhang, T. Goldstein and G. Taylor, Layer-specific adaptive learning rates for deep networks, ICMLA, arXiv: 1510.04609, (2015). doi: 10.1109/ICMLA.2015.113.  Google Scholar

[48]

R. Tibshirani, Regression shrinkage and selection via the Lasso, Journal of the Royal Statistical Society. Series B, 58 (1996), 267-288.  doi: 10.1111/j.2517-6161.1996.tb02080.x.  Google Scholar

[49]

T. Tieleman and G. Hinton, Lecture 6.5 - RMSprop: Divide the gradient by a running average of its recent magnitude, COURSERA: Neural Networks for Machine Learning, (2012). Google Scholar

[50]

M. Welling and Y. W. Teh, Bayesian learning via stochastic gradient Langevin dynamics, Proceedings of the 28th International Conference on Machine Learning, (2011), 681–688. Google Scholar

[51]

P. Williams, Bayesian regularization and pruning using a Laplace prior, Neural Computation, 7 (1995), 117-143.  doi: 10.1162/neco.1995.7.1.117.  Google Scholar

[52]

A. C. Wilson, R. Roelofs, M. Stern, N. Srebro and B. Recht, The marginal value of adaptive gradient methods in machine learning, arXiv: 1705.08292, (2017). Google Scholar

[53]

B. Xu, N. Wang, T. Chen and M. Li, Empirical evaluation of rectified activations in convolutional network., CoRR, arXiv: 1505.00853, (2015). Google Scholar

[54]

M. Zeiler, ADADELTA: An adaptive learning rate method, CoRR, arXiv: 1212.5701, (2012). Google Scholar

[55]

C. Zhang, S. Bengio, M. Hardt, B. Recht and O. Vinyals, Understanding deep learning requires rethinking generalization, ICLR, arXiv: 1611.03530, (2017). Google Scholar

[56]

R. Zwanzig, Diffusion in a rough potential, Proc. Natl. Acad. Sci. USA, 85 (1988), 2029-2030.  doi: 10.1073/pnas.85.7.2029.  Google Scholar

show all references

References:
[1]

A. Avati, K. Jung, S. Harman, L. Downing, A. Ng and N. Shah, Improving palliative care with deep learning, 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), (2017). doi: 10.1109/BIBM.2017.8217669.  Google Scholar

[2]

A. J. BallardR. DasS. MartinianiD. MehtaL. SagunJ. D. Stevenson and D. J. Wales, Energy landscapes for machine learning, Phys. Chem. Chem. Phys., 19 (2017), 12585-12603.  doi: 10.1039/C7CP01108C.  Google Scholar

[3]

N. Brosse, A. Durmus and E. Moulines, The promises and pitfalls of stochastic gradient Langevin dynamics, NIPS, (2018), 8268–8278. Google Scholar

[4]

A. ChoromanskaM. HenaffM. MathieuG. Arous and Y. LeCun, The loss surfaces of multilayer networks, Journal of Machine Learning Research, 38 (2015), 192-204.   Google Scholar

[5]

Y. Dauphin, R. Pascanum C. Gülçehre, K. Cho, S. Ganguli and Y. Bengio, Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, NIPS, (2014). Google Scholar

[6]

N. Ding, Y. Fang, R. Babbush, C. Chen, R. D. Skeel and H. Neven, Bayesian sampling using stochastic gradient thermostats, NIPS, (2014), 3203–3211. Google Scholar

[7]

J. DolbeaultC. Mouhot and C. Schmeiser, Hypocoercivity for kinetic equations with linear relaxation terms, C. R. Math. Acad. Sci. Paris, 347 (2009), 511-516.  doi: 10.1016/j.crma.2009.02.025.  Google Scholar

[8]

J. DuchiE. Hazan and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, Journal of Machine Learning Research, 12 (2011), 2121-2159.   Google Scholar

[9]

A. Durmus and E. Moulines, Non-asymptotic convergence analysis for the unadjusted Langevin algorithm, The Annals of Applied Probability, 27 (2017), 1551-1587.  doi: 10.1214/16-AAP1238.  Google Scholar

[10]

C. Gardiner, Handbook of Stochastic Methods for Physics, Chemistry, and the Natural Sciences, 3rd edn. Springer, New York, 2004. doi: 10.1007/978-3-662-05389-8.  Google Scholar

[11]

C. J. Geyer, Markov Chain Monte Carlo Maximum Likelihood, , Computer Science and Statistics, 1991. Google Scholar

[12]

X. Glorot, A. Bordes and Y. Bengio, Deep Sparse Rectifier Networks, AISTATS, 2011. Google Scholar

[13]

I. J. Goodfellow, O. Vinyals and A. M. Saxe, Qualitatively characterizing neural network optimization problems, ICLR, 2015. Google Scholar

[14]

K. He, X. Zhang, S. Ren and J. Sun, Delving deep into rectifiers: Surpassing human-level performance on Imagenet classification, Proceedings of the IEEE international conference on computer vision, (2015), 1026–1034. Google Scholar

[15]

D. P. Herzog, Exponential relaxation of the Nosé-Hoover equation under Brownian heating, Communications in Mathematical Sciences, 16 (2018), 2231-2260.  doi: 10.4310/CMS.2018.v16.n8.a8.  Google Scholar

[16]

A. Hoerl and R. Kennard, Ridge regression: Biased estimation for nonorthogonal problems, Technometrics, 12 (1970), 55-67.   Google Scholar

[17]

W. Hoover, Canonical dynamics: Equilibrium phase-space distributions, Phys. Rev. A., 31 (1985), 1695-1697.  doi: 10.1103/PhysRevA.31.1695.  Google Scholar

[18]

W. R. Huang, Z. Emam, M. Goldblum, L. Fowl, J. K. Terry, F. Huang and T. Goldstein, Understanding generalization through visualizations, arXiv: 1906.03291, (2019). Google Scholar

[19]

D. J. Im, M. Tao and K. Branson, An empirical analysis of deep network loss surfaces, CoRR, arXiv: 1612.04010, (2016). Google Scholar

[20]

K. Jarrett, K. Kavukcuoglu, M. Ranzato and Y. LeCun, What is the best multi-stage architecture for object recognition?, ICCV, (2009). doi: 10.1109/ICCV.2009.5459469.  Google Scholar

[21]

S. Jastrzȩbski, Z, Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio and A. J. Storkey, Three factors influencing minima in SGD, CoRR, arXiv: 1711.04623, (2017). Google Scholar

[22]

A. Jones and B. Leimkuhler, Adaptive stochastic methods for sampling driven molecular systems, The Journal of Chemical Physics, 135 (2011), 084125. doi: 10.1063/1.3626941.  Google Scholar

[23]

D. King, Dlib-ml: A machine learning toolkit, Journal of Machine Learning Research, 10 (2009), 1755-1758.   Google Scholar

[24]

D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, ICLR, (2015). Google Scholar

[25]

S. KirkpatrickC. D. Gelatt and M. P. Vecchi, Optimization by simulated annealing, Science, 220 (1983), 671-680.  doi: 10.1126/science.220.4598.671.  Google Scholar

[26]

H. Kushner and G. G. Yin, Stochastic Approximation and Recursive Algorithms and Applications, Second edition. Applications of Mathematics (New York), 35. Stochastic Modelling and Applied Probability. Springer-Verlag, New York, 2003.  Google Scholar

[27]

J. Lan, R. Liu, H. Zhou and J. Yosinski, LCA: Loss change allocation for neural network training, preprint, arXiv: 1909.01440, (2019). Google Scholar

[28]

B. Leimkuhler and C. Matthews, Molecular Dynamics: With Deterministic and Stochastic Numerical Methods, Interdisciplinary Applied Mathematics, Springer, 2015. doi: 10.1007/978-3-319-16375-8.  Google Scholar

[29]

B. LeimkuhlerC. Matthews and G. Stoltz, The computation of averages from equilibrium and nonequilibrium Langevin molecular dynamics, IMA Journal of Numerical Analysis, 36 (2016), 13-79.  doi: 10.1093/imanum/dru056.  Google Scholar

[30]

B. Leimkuhler, M. Sachs and G. Stoltz, Hypocoercivity properties of adaptive Langevin dynamics, preprint, arXiv: 1908.09363, (2019). Google Scholar

[31]

B. Leimkuhler and X. Shang, Adaptive thermostats for noisy gradient systems, SIAM Journal on Scientific Computing, 38 (2016), A712–A736. doi: 10.1137/15M102318X.  Google Scholar

[32]

E. Marinari and G. Parisi, Simulated tempering: A new Monte Carlo scheme, Europhysics Letters, 19 (1992). doi: 10.1209/0295-5075/19/6/002.  Google Scholar

[33]

J. C. MattinglyA. M. Stuart and D. J. Higham, Ergodicity for SDEs and approximations: locally Lipschitz vector fields and degenerate noise, Stochastic Processes and their Applications, 101 (2002), 185-232.  doi: 10.1016/S0304-4149(02)00150-3.  Google Scholar

[34]

S. P. Meyn and R. L. Tweedie, Stability of Markovian processes Ⅱ: Continuous-time processes and sampled chains, Advances in Applied Probability, 25 (1993), 487-517.  doi: 10.2307/1427521.  Google Scholar

[35] K.P. Murphy, Machine Learning: A Probabilistic Perspective, MIT Press, 2012.   Google Scholar
[36]

R. M. Neal, Bayesian Learning for Neural Networks, Springer-Verlag, New York, 1996. doi: 10.1007/978-1-4612-0745-0.  Google Scholar

[37]

B. Neyshabur, R. Tomioka and N. Srebro, In search of the real inductive bias: On the role of implicit regularization in deep learning, Proceeding of the International Conference on Learning Representations workshop track, arXiv: 1412.6614, (2015). Google Scholar

[38]

S. Nosé, A unified formulation of the constant temperature molecular dynamics methods, The Journal of Chemical Physics, 81 (1984), 511-519.   Google Scholar

[39]

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga and A. Lerer, Automatic differentiation in PyTorch, (2017). Google Scholar

[40]

E. PollakA. Auerbach and P. Talkner, Observations on rate theory for rugged energy landscapes, Biophysical Journal, 95 (2008), 4258-4265.   Google Scholar

[41]

G. O. Roberts and R. L. Tweedie, Exponential convergence of Langevin distributions and their discrete approximations, Bernoulli, 2 (1996), 341-363.  doi: 10.2307/3318418.  Google Scholar

[42]

M. Sachs, B. Leimkuhler and V. Danos, Langevin dynamics with variable coefficients and nonconservative forces: from stationary states to numerical methods, Entropy, 19 (2017), 647. doi: 10.3390/e19120647.  Google Scholar

[43]

L. Sagun, L. Bottou and Y. LeCun, Singularity of the Hessian in deep learning, ICLR, (2017). Google Scholar

[44]

L. Sagun, U. Evci, U. Güney, Y. Dauphin and L. Bottou, Empirical analysis of the Hessian of over-parametrized neural networks, ICLR, arXiv: 1706.04454, (2018). Google Scholar

[45]

K. T. Schütt, F. Arbabzadah, S. Chmiela, K. R. Müller and A. Tkatchenko, Quantum-chemical insights from deep tensor neural networks, Nature Communications, 8 (2017). Google Scholar

[46]

D. SilverT. HubertJ. SchrittwieserI. AntonoglouM. LaiA. GuezM. LanctotL. SifreD. KumaranT. GraepelT. LillicrapK. Simonyan and D. Hassabis, A general reinforcement learning algorithm that masters Chess, Shogi, and Go through self-play, Science, 362 (2018), 1140-1144.  doi: 10.1126/science.aar6404.  Google Scholar

[47]

B. Singh, S. De, Y. Zhang, T. Goldstein and G. Taylor, Layer-specific adaptive learning rates for deep networks, ICMLA, arXiv: 1510.04609, (2015). doi: 10.1109/ICMLA.2015.113.  Google Scholar

[48]

R. Tibshirani, Regression shrinkage and selection via the Lasso, Journal of the Royal Statistical Society. Series B, 58 (1996), 267-288.  doi: 10.1111/j.2517-6161.1996.tb02080.x.  Google Scholar

[49]

T. Tieleman and G. Hinton, Lecture 6.5 - RMSprop: Divide the gradient by a running average of its recent magnitude, COURSERA: Neural Networks for Machine Learning, (2012). Google Scholar

[50]

M. Welling and Y. W. Teh, Bayesian learning via stochastic gradient Langevin dynamics, Proceedings of the 28th International Conference on Machine Learning, (2011), 681–688. Google Scholar

[51]

P. Williams, Bayesian regularization and pruning using a Laplace prior, Neural Computation, 7 (1995), 117-143.  doi: 10.1162/neco.1995.7.1.117.  Google Scholar

[52]

A. C. Wilson, R. Roelofs, M. Stern, N. Srebro and B. Recht, The marginal value of adaptive gradient methods in machine learning, arXiv: 1705.08292, (2017). Google Scholar

[53]

B. Xu, N. Wang, T. Chen and M. Li, Empirical evaluation of rectified activations in convolutional network., CoRR, arXiv: 1505.00853, (2015). Google Scholar

[54]

M. Zeiler, ADADELTA: An adaptive learning rate method, CoRR, arXiv: 1212.5701, (2012). Google Scholar

[55]

C. Zhang, S. Bengio, M. Hardt, B. Recht and O. Vinyals, Understanding deep learning requires rethinking generalization, ICLR, arXiv: 1611.03530, (2017). Google Scholar

[56]

R. Zwanzig, Diffusion in a rough potential, Proc. Natl. Acad. Sci. USA, 85 (1988), 2029-2030.  doi: 10.1073/pnas.85.7.2029.  Google Scholar

Figure 1.  The figure shows classifiers computed using the BAOAB Langevin dynamics integrator. Visually, good classification is obtained if the contrast is high between the color of plotted data and the color of the classifier, thus indicating a clear separation of the two sets of labelled data points. The same stepsize ($ h = 0.4 $) and total number of steps $ N = 50,000 $ was used in each training run. The friction was also held fixed at $ \gamma = 10 $. A 500 node SHLP was used with ReLU activation, sigmoidal output and a standard cross entropy loss function. The temperatures were set to $ \tau = $1e-8 (upper left), $ \tau = $1e-7 (upper right), $ \tau = $1e-6 (lower left) and $ \tau = $1e-5 (lower right). The figures show that the classifier substantially improves as the temperature is raised. The test accuracies for each run are also shown at the top of each figure. The data is given by Eq. (17) with a = 3, b = 2 and c = 0.02. We used 1000 training, 1000 test data points and 2% subsampling
Figure 2.  Spiral data and trigonometric data typical of those used in our classification studies
Figure 3.  Left: graph of the loss along the line (18) for the MNIST dataset. It is clear that AdLaLa and Adam converge to different minima, although we used the exact same initialization for both methods. There is no evidence of a loss-barrier. Their final test loss is similar. Right: the same construct for a simple spiral with one turn, i.e., $ b = 1 $ in Eq. (16). As for MNIST there is no evidence of a loss-barrier
Figure 4.  The left and right plots are for two runs with the same parameters but different initializations. We train a 20 node SHLP on the two turn spiral dataset, i.e., $ b = 2 $ in Eq. (16), for 20,000 steps, with 500 training and test data points and 5% subsampling. Left: The parameterization that AdLaLa finds gives: 100% train, 99% test. Adam gets: 88% train, 91% test; Right: AdLaLa: 100 % train, 98 % test. Adam: 96% train, 94 % test
Figure 5.  MNIST (left) vs. Spirals (2-turn) (right) on Test
Figure 6.  Weight and bias distributions for the 2-turn spirals dataset at different times and for different methods. Parameter settings: $ h_{\text{SGD}} = 0.2, h_{\text{Adam}} = 0.005 $, SGLD: $ h_{\text{SGLD}} = 0.1 $ and $ \sigma_{\text{SGLD}} = 0.01 $. AdLaLa: $ h_{\text{AdLaLa}} = 0.25, \sigma_A = 0.01 $, $ \tau_1 = \tau_2 = 10^{-4}, \epsilon = 0.1 $ and $ \gamma = 0.5 $. Test accuracy at step 50: 0.66 (SGD), 0.65 (Adam), 0.61 (SGLD), 0.62 (AdLaLa); at step 1000: 0.66 (SGD), 0.89 (Adam), 0.68 (SGLD), 0.82 (AdLaLa); at step 10000: 0.96 (SGD), 0.99 (Adam), 0.74 (SGLD), 0.99 (AdLaLa)
Figure 7.  Evolution of weights for the 4-turn spiral problem. Same parameter settings as in Fig. 6, but $ \gamma = 0.1 $ in AdLaLa. Test accuracy at step 50: 0.5 (SGD), 0.58 (Adam), 0.52 (SGLD), 0.45 (AdLaLa); at step 1000: 0.56 (SGD), 0.55 (Adam), 0.5 (SGLD), 0.62 (AdLaLa); at step 10k: 0.58 (SGD), 0.67 (Adam), 0.54 (SGLD), 0.8 (AdLaLa)
Figure 8.  Obtained parameter distributions over 100 runs after using different optimizers for the 2-turn spiral problem for 10K steps. Parameter settings: $ h_{\text{SGD}} = 0.1, h_{\text{Adam}} = 0.005, h_{\text{SGLD}} = 0.1, \sigma_{\text{SGLD}} = 0.1 $, AdLaLa has $ h_{\text{AdLaLa}} = 0.25, \tau_1 = \tau_2 = 10^{-4}, \sigma_A = 0.01, \epsilon = 0.1 $, $ \gamma = 0.5 $ (left) and $ \gamma = 10 $ (right). Average test accuracies: SGD: 79%, Adam: 83.7%, SGLD: 78%, AdLaLa ($ \gamma = 0.5 $): 93.4%, AdLaLa ($ \gamma = 10 $): 85.5%
Figure 9.  Comparison of classifiers for a 500-node SHLP on 4-turn spiral data (with $ a = 2, b = 4, c = 0.02, p = 1 $ in Eq. (16)) generated by Adam (top row) vs AdLaLa (bottom row). For Adam the stepsize used was $ h = 0.005 $. Adam was initialized with Gaussian weights with standard deviation 0.5. For AdLaLa the parameters were $ \epsilon = 0.1 $, $ \tau_1 = 0.0001 $, $ \sigma_A = 0.01 $, $ \gamma_2 = 0.03 $, $ \tau_2 = 0.00001, h = 0.1 $. Weights were initialized as Gaussian with standard deviation 0.01. For both methods we used 2% subsampling per step. From left to right in each row: 20K steps (400 epochs); 40K steps (800 epochs); 60K steps (1200 epochs). For visualization the classifier was averaged over the last 10 steps of training
Figure 10.  AdLaLa (black dotted horizontal line in both figures) consistently outperforms SGD, SGLD (left figure) and Adam (right figure) for the spiral 4-turn dataset. The different bars in the left figure indicate SGLD with different values of $ \sigma $, namely $ \sigma = 0 $ (blue, this is standard SGD), $ \sigma $ = 0.005 (red), $ \sigma $ = 0.01 (yellow), $ \sigma $ = 0.05 (purple), $ \sigma $ = 0.1 (green). Whereas the set of parameter values for AdLaLa is fixed, the parameters of the other methods were varied to show the general superiority of AdLaLa. The results were averaged over multiple runs and the same initial conditions were used for all runs. The parameters used for AdLaLa were $ h = 0.25, \tau_1 = \tau_2 = 10^{-4}, \gamma = 0.1, \sigma_A = 0.01, \epsilon = 0.05 $
Figure 11.  Test loss/accuracy obtained for planar trigonometric data (with a = 6 in Eq. (17)) using different optimizers and a 100 node SHLP, 1000 test data, 1000 training data and 5% subsampling. The parameters for LOL are set to $ h = 0.1, \gamma_1 = 0.01, \tau_1 = 10^{-3} $. For AdLaLa we used parameters: $ h = 0.2, \tau_1 = \tau_2 = 10^{-4}, \gamma = 10, \sigma_A = 0.001, \epsilon = 0.1 $
Figure 12.  Test loss/accuracy obtained for planar trigonometric data (with a = 10 in Eq. (17)) with a 100 node SHLP, which was parameterized using different optimizers. The results were averaged over 20 runs. Hyperparameters settings: for LOL: $ h = 0.1, \gamma_1 = 0.01, \tau_1 = 10^{-3} $; for AdLaLa: $ h = 0.1, \tau_1 = \tau_2 = 10^{-4}, \gamma = 5, \sigma_A = 0.001, \epsilon = 0.1 $; for SGLD: $ h = 0.1, \sigma = 0.01 $
Figure 13.  Obtained while training a 500-node SHLP on the 2-turn spiral (with $ c = 0.1 $ in Eq. (16)). We used $ h_{\text{SGD}} = 0.1, h_{\text{Adam}} = 0.005 $, for LOL: $ h = 0.1, \gamma_1 = 1, \tau_1 = 10^{-6} $, for AdLaLa: $ h = 0.1, \tau_1 = 10^{-4}, \tau_2 = 10^{-8}, \gamma = 1000, \sigma_A = 0.01, \epsilon = 0.1 $
Figure 14.  Variance (top) and mean (bottom) in test accuracies obtained over 100 runs on the two-turn spiral problem using SGD (red) with $ h = 0.25 $, Adam (dark blue) with $ h = 0.005 $ and 0.01 $ \cdot \mathcal{N}(0,1) $ initialization for the weights, Adam (light blue) with $ \mathcal{U}(-1/\sqrt{N_{in}},1/\sqrt{N_{in}}) $ (standard PyTorch) initialization for the weights (where $ N_{in} $ is the number of inputs to the layer), LOL (yellow) with $ h = 0.25, \gamma_1 = 0.01, \tau_1 = 10^{-3} $, and AdLaLa (purple) with $ h = 0.25, \tau_1 = \tau_2 = 10^{-4}, \gamma = 0.5, \sigma_A = 0.01, \epsilon = 0.1 $ with Gaussian initialization, AdLaLa (green) with standard PyTorch initialization. We used a 20 node SHLP, 500 training data and 2% subsampling
Figure 15.  We run the AdLaLa scheme on an SHLP with 100 hidden nodes on the four turn spiral problem. Pixels indicate the average test accuracy with corresponding parameters, from ten independent runs, where $ \gamma_2 = 0.03 $, $ \epsilon = 0.1 $, $ \tau_2 = 10^{-8} $, and $ h = 0.1 $
Figure 16.  Comparison of classifiers for a 200-node SHLP on 4-turn spiral data generated by LOL with different temperature values. The friction was set at 1 in all experiments and 50,000 steps were performed with stepsize 0.8 (similar to large stepsizes used in SGD). Here performance increased with increasing $ \tau $ until $ \tau = 0.00001 $ after which it began to decrease. (The method is unusable already for $ \tau = 0.001 $.)
[1]

King Hann Lim, Hong Hui Tan, Hendra G. Harno. Approximate greatest descent in neural network optimization. Numerical Algebra, Control & Optimization, 2018, 8 (3) : 327-336. doi: 10.3934/naco.2018021

[2]

Feng Bao, Thomas Maier. Stochastic gradient descent algorithm for stochastic optimization in solving analytic continuation problems. Foundations of Data Science, 2020, 2 (1) : 1-17. doi: 10.3934/fods.2020001

[3]

Jianfeng Feng, Mariya Shcherbina, Brunello Tirozzi. Stability of the dynamics of an asymmetric neural network. Communications on Pure & Applied Analysis, 2009, 8 (2) : 655-671. doi: 10.3934/cpaa.2009.8.655

[4]

Wataru Nakamura, Yasushi Narushima, Hiroshi Yabe. Nonlinear conjugate gradient methods with sufficient descent properties for unconstrained optimization. Journal of Industrial & Management Optimization, 2013, 9 (3) : 595-619. doi: 10.3934/jimo.2013.9.595

[5]

Hui-Qiang Ma, Nan-Jing Huang. Neural network smoothing approximation method for stochastic variational inequality problems. Journal of Industrial & Management Optimization, 2015, 11 (2) : 645-660. doi: 10.3934/jimo.2015.11.645

[6]

Håkon Hoel, Anders Szepessy. Classical Langevin dynamics derived from quantum mechanics. Discrete & Continuous Dynamical Systems - B, 2020  doi: 10.3934/dcdsb.2020135

[7]

Gaohang Yu, Lutai Guan, Guoyin Li. Global convergence of modified Polak-Ribière-Polyak conjugate gradient methods with sufficient descent property. Journal of Industrial & Management Optimization, 2008, 4 (3) : 565-579. doi: 10.3934/jimo.2008.4.565

[8]

Deena Schmidt, Janet Best, Mark S. Blumberg. Random graph and stochastic process contributions to network dynamics. Conference Publications, 2011, 2011 (Special) : 1279-1288. doi: 10.3934/proc.2011.2011.1279

[9]

Fengqiu Liu, Xiaoping Xue. Subgradient-based neural network for nonconvex optimization problems in support vector machines with indefinite kernels. Journal of Industrial & Management Optimization, 2016, 12 (1) : 285-301. doi: 10.3934/jimo.2016.12.285

[10]

Jui-Pin Tseng. Global asymptotic dynamics of a class of nonlinearly coupled neural networks with delays. Discrete & Continuous Dynamical Systems - A, 2013, 33 (10) : 4693-4729. doi: 10.3934/dcds.2013.33.4693

[11]

Fang Han, Bin Zhen, Ying Du, Yanhong Zheng, Marian Wiercigroch. Global Hopf bifurcation analysis of a six-dimensional FitzHugh-Nagumo neural network with delay by a synchronized scheme. Discrete & Continuous Dynamical Systems - B, 2011, 16 (2) : 457-474. doi: 10.3934/dcdsb.2011.16.457

[12]

Shishun Li, Zhengda Huang. Guaranteed descent conjugate gradient methods with modified secant condition. Journal of Industrial & Management Optimization, 2008, 4 (4) : 739-755. doi: 10.3934/jimo.2008.4.739

[13]

Xiaming Chen. Kernel-based online gradient descent using distributed approach. Mathematical Foundations of Computing, 2019, 2 (1) : 1-9. doi: 10.3934/mfc.2019001

[14]

Ting Hu. Kernel-based maximum correntropy criterion with gradient descent method. Communications on Pure & Applied Analysis, 2020, 19 (8) : 4159-4177. doi: 10.3934/cpaa.2020186

[15]

Ying Sue Huang, Chai Wah Wu. Stability of cellular neural network with small delays. Conference Publications, 2005, 2005 (Special) : 420-426. doi: 10.3934/proc.2005.2005.420

[16]

Shyan-Shiou Chen, Chih-Wen Shih. Asymptotic behaviors in a transiently chaotic neural network. Discrete & Continuous Dynamical Systems - A, 2004, 10 (3) : 805-826. doi: 10.3934/dcds.2004.10.805

[17]

Ndolane Sene. Fractional input stability and its application to neural network. Discrete & Continuous Dynamical Systems - S, 2020, 13 (3) : 853-865. doi: 10.3934/dcdss.2020049

[18]

T. Hillen, K. Painter, Christian Schmeiser. Global existence for chemotaxis with finite sampling radius. Discrete & Continuous Dynamical Systems - B, 2007, 7 (1) : 125-144. doi: 10.3934/dcdsb.2007.7.125

[19]

David J. Aldous. A stochastic complex network model. Electronic Research Announcements, 2003, 9: 152-161.

[20]

Valery Y. Glizer, Vladimir Turetsky, Emil Bashkansky. Statistical process control optimization with variable sampling interval and nonlinear expected loss. Journal of Industrial & Management Optimization, 2015, 11 (1) : 105-133. doi: 10.3934/jimo.2015.11.105

 Impact Factor: 

Metrics

  • PDF downloads (85)
  • HTML views (405)
  • Cited by (0)

[Back to Top]