
-
Previous Article
Cluster, classify, regress: A general method for learning discontinuous functions
- FoDS Home
- This Issue
-
Next Article
On the incorporation of box-constraints for ensemble Kalman inversion
Partitioned integrators for thermodynamic parameterization of neural networks
School of Mathematics and Maxwell Institute for the Mathematical Sciences, University of Edinburgh, Edinburgh EH9 3FD, United Kingdom |
Traditionally, neural networks are parameterized using optimization procedures such as stochastic gradient descent, RMSProp and ADAM. These procedures tend to drive the parameters of the network toward a local minimum. In this article, we employ alternative "sampling" algorithms (referred to here as "thermodynamic parameterization methods") which rely on discretized stochastic differential equations for a defined target distribution on parameter space. We show that the thermodynamic perspective already improves neural network training. Moreover, by partitioning the parameters based on natural layer structure we obtain schemes with very rapid convergence for data sets with complicated loss landscapes.
We describe easy-to-implement hybrid partitioned numerical algorithms, based on discretized stochastic differential equations, which are adapted to feed-forward neural networks, including a multi-layer Langevin algorithm, AdLaLa (combining the adaptive Langevin and Langevin algorithms) and LOL (combining Langevin and Overdamped Langevin); we examine the convergence of these methods using numerical studies and compare their performance among themselves and in relation to standard alternatives such as stochastic gradient descent and ADAM. We present evidence that thermodynamic parameterization methods can be (ⅰ) faster, (ⅱ) more accurate, and (ⅲ) more robust than standard algorithms used within machine learning frameworks.
References:
[1] |
A. Avati, K. Jung, S. Harman, L. Downing, A. Ng and N. Shah, Improving palliative care with deep learning, 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), (2017).
doi: 10.1109/BIBM.2017.8217669. |
[2] |
A. J. Ballard, R. Das, S. Martiniani, D. Mehta, L. Sagun, J. D. Stevenson and D. J. Wales,
Energy landscapes for machine learning, Phys. Chem. Chem. Phys., 19 (2017), 12585-12603.
doi: 10.1039/C7CP01108C. |
[3] |
N. Brosse, A. Durmus and E. Moulines, The promises and pitfalls of stochastic gradient Langevin dynamics, NIPS, (2018), 8268–8278. Google Scholar |
[4] |
A. Choromanska, M. Henaff, M. Mathieu, G. Arous and Y. LeCun, The loss surfaces of multilayer networks, Journal of Machine Learning Research, 38 (2015), 192-204. Google Scholar |
[5] |
Y. Dauphin, R. Pascanum C. Gülçehre, K. Cho, S. Ganguli and Y. Bengio, Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, NIPS, (2014). Google Scholar |
[6] |
N. Ding, Y. Fang, R. Babbush, C. Chen, R. D. Skeel and H. Neven, Bayesian sampling using stochastic gradient thermostats, NIPS, (2014), 3203–3211. Google Scholar |
[7] |
J. Dolbeault, C. Mouhot and C. Schmeiser,
Hypocoercivity for kinetic equations with linear relaxation terms, C. R. Math. Acad. Sci. Paris, 347 (2009), 511-516.
doi: 10.1016/j.crma.2009.02.025. |
[8] |
J. Duchi, E. Hazan and Y. Singer,
Adaptive subgradient methods for online learning and stochastic optimization, Journal of Machine Learning Research, 12 (2011), 2121-2159.
|
[9] |
A. Durmus and E. Moulines,
Non-asymptotic convergence analysis for the unadjusted Langevin algorithm, The Annals of Applied Probability, 27 (2017), 1551-1587.
doi: 10.1214/16-AAP1238. |
[10] |
C. Gardiner, Handbook of Stochastic Methods for Physics, Chemistry, and the Natural Sciences, 3rd edn. Springer, New York, 2004.
doi: 10.1007/978-3-662-05389-8. |
[11] |
C. J. Geyer, Markov Chain Monte Carlo Maximum Likelihood, , Computer Science and Statistics, 1991. Google Scholar |
[12] |
X. Glorot, A. Bordes and Y. Bengio, Deep Sparse Rectifier Networks, AISTATS, 2011. Google Scholar |
[13] |
I. J. Goodfellow, O. Vinyals and A. M. Saxe, Qualitatively characterizing neural network optimization problems, ICLR, 2015. Google Scholar |
[14] |
K. He, X. Zhang, S. Ren and J. Sun, Delving deep into rectifiers: Surpassing human-level performance on Imagenet classification, Proceedings of the IEEE international conference on computer vision, (2015), 1026–1034. Google Scholar |
[15] |
D. P. Herzog,
Exponential relaxation of the Nosé-Hoover equation under Brownian heating, Communications in Mathematical Sciences, 16 (2018), 2231-2260.
doi: 10.4310/CMS.2018.v16.n8.a8. |
[16] |
A. Hoerl and R. Kennard, Ridge regression: Biased estimation for nonorthogonal problems, Technometrics, 12 (1970), 55-67. Google Scholar |
[17] |
W. Hoover,
Canonical dynamics: Equilibrium phase-space distributions, Phys. Rev. A., 31 (1985), 1695-1697.
doi: 10.1103/PhysRevA.31.1695. |
[18] |
W. R. Huang, Z. Emam, M. Goldblum, L. Fowl, J. K. Terry, F. Huang and T. Goldstein, Understanding generalization through visualizations, arXiv: 1906.03291, (2019). Google Scholar |
[19] |
D. J. Im, M. Tao and K. Branson, An empirical analysis of deep network loss surfaces, CoRR, arXiv: 1612.04010, (2016). Google Scholar |
[20] |
K. Jarrett, K. Kavukcuoglu, M. Ranzato and Y. LeCun, What is the best multi-stage architecture for object recognition?, ICCV, (2009).
doi: 10.1109/ICCV.2009.5459469. |
[21] |
S. Jastrzȩbski, Z, Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio and A. J. Storkey, Three factors influencing minima in SGD, CoRR, arXiv: 1711.04623, (2017). Google Scholar |
[22] |
A. Jones and B. Leimkuhler, Adaptive stochastic methods for sampling driven molecular systems, The Journal of Chemical Physics, 135 (2011), 084125.
doi: 10.1063/1.3626941. |
[23] |
D. King, Dlib-ml: A machine learning toolkit, Journal of Machine Learning Research, 10 (2009), 1755-1758. Google Scholar |
[24] |
D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, ICLR, (2015). Google Scholar |
[25] |
S. Kirkpatrick, C. D. Gelatt and M. P. Vecchi,
Optimization by simulated annealing, Science, 220 (1983), 671-680.
doi: 10.1126/science.220.4598.671. |
[26] |
H. Kushner and G. G. Yin, Stochastic Approximation and Recursive Algorithms and Applications, Second edition. Applications of Mathematics (New York), 35. Stochastic Modelling and Applied Probability. Springer-Verlag, New York, 2003. |
[27] |
J. Lan, R. Liu, H. Zhou and J. Yosinski, LCA: Loss change allocation for neural network training, preprint, arXiv: 1909.01440, (2019). Google Scholar |
[28] |
B. Leimkuhler and C. Matthews, Molecular Dynamics: With Deterministic and Stochastic Numerical Methods, Interdisciplinary Applied Mathematics, Springer, 2015.
doi: 10.1007/978-3-319-16375-8. |
[29] |
B. Leimkuhler, C. Matthews and G. Stoltz,
The computation of averages from equilibrium and nonequilibrium Langevin molecular dynamics, IMA Journal of Numerical Analysis, 36 (2016), 13-79.
doi: 10.1093/imanum/dru056. |
[30] |
B. Leimkuhler, M. Sachs and G. Stoltz, Hypocoercivity properties of adaptive Langevin dynamics, preprint, arXiv: 1908.09363, (2019). Google Scholar |
[31] |
B. Leimkuhler and X. Shang, Adaptive thermostats for noisy gradient systems, SIAM Journal on Scientific Computing, 38 (2016), A712–A736.
doi: 10.1137/15M102318X. |
[32] |
E. Marinari and G. Parisi, Simulated tempering: A new Monte Carlo scheme, Europhysics Letters, 19 (1992).
doi: 10.1209/0295-5075/19/6/002. |
[33] |
J. C. Mattingly, A. M. Stuart and D. J. Higham,
Ergodicity for SDEs and approximations: locally Lipschitz vector fields and degenerate noise, Stochastic Processes and their Applications, 101 (2002), 185-232.
doi: 10.1016/S0304-4149(02)00150-3. |
[34] |
S. P. Meyn and R. L. Tweedie,
Stability of Markovian processes Ⅱ: Continuous-time processes and sampled chains, Advances in Applied Probability, 25 (1993), 487-517.
doi: 10.2307/1427521. |
[35] | K.P. Murphy, Machine Learning: A Probabilistic Perspective, MIT Press, 2012. Google Scholar |
[36] |
R. M. Neal, Bayesian Learning for Neural Networks, Springer-Verlag, New York, 1996.
doi: 10.1007/978-1-4612-0745-0. |
[37] |
B. Neyshabur, R. Tomioka and N. Srebro, In search of the real inductive bias: On the role of implicit regularization in deep learning, Proceeding of the International Conference on Learning Representations workshop track, arXiv: 1412.6614, (2015). Google Scholar |
[38] |
S. Nosé, A unified formulation of the constant temperature molecular dynamics methods, The Journal of Chemical Physics, 81 (1984), 511-519. Google Scholar |
[39] |
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga and A. Lerer, Automatic differentiation in PyTorch, (2017). Google Scholar |
[40] |
E. Pollak, A. Auerbach and P. Talkner, Observations on rate theory for rugged energy landscapes, Biophysical Journal, 95 (2008), 4258-4265. Google Scholar |
[41] |
G. O. Roberts and R. L. Tweedie,
Exponential convergence of Langevin distributions and their discrete approximations, Bernoulli, 2 (1996), 341-363.
doi: 10.2307/3318418. |
[42] |
M. Sachs, B. Leimkuhler and V. Danos, Langevin dynamics with variable coefficients and nonconservative forces: from stationary states to numerical methods, Entropy, 19 (2017), 647.
doi: 10.3390/e19120647. |
[43] |
L. Sagun, L. Bottou and Y. LeCun, Singularity of the Hessian in deep learning, ICLR, (2017). Google Scholar |
[44] |
L. Sagun, U. Evci, U. Güney, Y. Dauphin and L. Bottou, Empirical analysis of the Hessian of over-parametrized neural networks, ICLR, arXiv: 1706.04454, (2018). Google Scholar |
[45] |
K. T. Schütt, F. Arbabzadah, S. Chmiela, K. R. Müller and A. Tkatchenko, Quantum-chemical insights from deep tensor neural networks, Nature Communications, 8 (2017). Google Scholar |
[46] |
D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan and D. Hassabis,
A general reinforcement learning algorithm that masters Chess, Shogi, and Go through self-play, Science, 362 (2018), 1140-1144.
doi: 10.1126/science.aar6404. |
[47] |
B. Singh, S. De, Y. Zhang, T. Goldstein and G. Taylor, Layer-specific adaptive learning rates for deep networks, ICMLA, arXiv: 1510.04609, (2015).
doi: 10.1109/ICMLA.2015.113. |
[48] |
R. Tibshirani,
Regression shrinkage and selection via the Lasso, Journal of the Royal Statistical Society. Series B, 58 (1996), 267-288.
doi: 10.1111/j.2517-6161.1996.tb02080.x. |
[49] |
T. Tieleman and G. Hinton, Lecture 6.5 - RMSprop: Divide the gradient by a running average of its recent magnitude, COURSERA: Neural Networks for Machine Learning, (2012). Google Scholar |
[50] |
M. Welling and Y. W. Teh, Bayesian learning via stochastic gradient Langevin dynamics, Proceedings of the 28th International Conference on Machine Learning, (2011), 681–688. Google Scholar |
[51] |
P. Williams,
Bayesian regularization and pruning using a Laplace prior, Neural Computation, 7 (1995), 117-143.
doi: 10.1162/neco.1995.7.1.117. |
[52] |
A. C. Wilson, R. Roelofs, M. Stern, N. Srebro and B. Recht, The marginal value of adaptive gradient methods in machine learning, arXiv: 1705.08292, (2017). Google Scholar |
[53] |
B. Xu, N. Wang, T. Chen and M. Li, Empirical evaluation of rectified activations in convolutional network., CoRR, arXiv: 1505.00853, (2015). Google Scholar |
[54] |
M. Zeiler, ADADELTA: An adaptive learning rate method, CoRR, arXiv: 1212.5701, (2012). Google Scholar |
[55] |
C. Zhang, S. Bengio, M. Hardt, B. Recht and O. Vinyals, Understanding deep learning requires rethinking generalization, ICLR, arXiv: 1611.03530, (2017). Google Scholar |
[56] |
R. Zwanzig,
Diffusion in a rough potential, Proc. Natl. Acad. Sci. USA, 85 (1988), 2029-2030.
doi: 10.1073/pnas.85.7.2029. |
show all references
References:
[1] |
A. Avati, K. Jung, S. Harman, L. Downing, A. Ng and N. Shah, Improving palliative care with deep learning, 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), (2017).
doi: 10.1109/BIBM.2017.8217669. |
[2] |
A. J. Ballard, R. Das, S. Martiniani, D. Mehta, L. Sagun, J. D. Stevenson and D. J. Wales,
Energy landscapes for machine learning, Phys. Chem. Chem. Phys., 19 (2017), 12585-12603.
doi: 10.1039/C7CP01108C. |
[3] |
N. Brosse, A. Durmus and E. Moulines, The promises and pitfalls of stochastic gradient Langevin dynamics, NIPS, (2018), 8268–8278. Google Scholar |
[4] |
A. Choromanska, M. Henaff, M. Mathieu, G. Arous and Y. LeCun, The loss surfaces of multilayer networks, Journal of Machine Learning Research, 38 (2015), 192-204. Google Scholar |
[5] |
Y. Dauphin, R. Pascanum C. Gülçehre, K. Cho, S. Ganguli and Y. Bengio, Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, NIPS, (2014). Google Scholar |
[6] |
N. Ding, Y. Fang, R. Babbush, C. Chen, R. D. Skeel and H. Neven, Bayesian sampling using stochastic gradient thermostats, NIPS, (2014), 3203–3211. Google Scholar |
[7] |
J. Dolbeault, C. Mouhot and C. Schmeiser,
Hypocoercivity for kinetic equations with linear relaxation terms, C. R. Math. Acad. Sci. Paris, 347 (2009), 511-516.
doi: 10.1016/j.crma.2009.02.025. |
[8] |
J. Duchi, E. Hazan and Y. Singer,
Adaptive subgradient methods for online learning and stochastic optimization, Journal of Machine Learning Research, 12 (2011), 2121-2159.
|
[9] |
A. Durmus and E. Moulines,
Non-asymptotic convergence analysis for the unadjusted Langevin algorithm, The Annals of Applied Probability, 27 (2017), 1551-1587.
doi: 10.1214/16-AAP1238. |
[10] |
C. Gardiner, Handbook of Stochastic Methods for Physics, Chemistry, and the Natural Sciences, 3rd edn. Springer, New York, 2004.
doi: 10.1007/978-3-662-05389-8. |
[11] |
C. J. Geyer, Markov Chain Monte Carlo Maximum Likelihood, , Computer Science and Statistics, 1991. Google Scholar |
[12] |
X. Glorot, A. Bordes and Y. Bengio, Deep Sparse Rectifier Networks, AISTATS, 2011. Google Scholar |
[13] |
I. J. Goodfellow, O. Vinyals and A. M. Saxe, Qualitatively characterizing neural network optimization problems, ICLR, 2015. Google Scholar |
[14] |
K. He, X. Zhang, S. Ren and J. Sun, Delving deep into rectifiers: Surpassing human-level performance on Imagenet classification, Proceedings of the IEEE international conference on computer vision, (2015), 1026–1034. Google Scholar |
[15] |
D. P. Herzog,
Exponential relaxation of the Nosé-Hoover equation under Brownian heating, Communications in Mathematical Sciences, 16 (2018), 2231-2260.
doi: 10.4310/CMS.2018.v16.n8.a8. |
[16] |
A. Hoerl and R. Kennard, Ridge regression: Biased estimation for nonorthogonal problems, Technometrics, 12 (1970), 55-67. Google Scholar |
[17] |
W. Hoover,
Canonical dynamics: Equilibrium phase-space distributions, Phys. Rev. A., 31 (1985), 1695-1697.
doi: 10.1103/PhysRevA.31.1695. |
[18] |
W. R. Huang, Z. Emam, M. Goldblum, L. Fowl, J. K. Terry, F. Huang and T. Goldstein, Understanding generalization through visualizations, arXiv: 1906.03291, (2019). Google Scholar |
[19] |
D. J. Im, M. Tao and K. Branson, An empirical analysis of deep network loss surfaces, CoRR, arXiv: 1612.04010, (2016). Google Scholar |
[20] |
K. Jarrett, K. Kavukcuoglu, M. Ranzato and Y. LeCun, What is the best multi-stage architecture for object recognition?, ICCV, (2009).
doi: 10.1109/ICCV.2009.5459469. |
[21] |
S. Jastrzȩbski, Z, Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio and A. J. Storkey, Three factors influencing minima in SGD, CoRR, arXiv: 1711.04623, (2017). Google Scholar |
[22] |
A. Jones and B. Leimkuhler, Adaptive stochastic methods for sampling driven molecular systems, The Journal of Chemical Physics, 135 (2011), 084125.
doi: 10.1063/1.3626941. |
[23] |
D. King, Dlib-ml: A machine learning toolkit, Journal of Machine Learning Research, 10 (2009), 1755-1758. Google Scholar |
[24] |
D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, ICLR, (2015). Google Scholar |
[25] |
S. Kirkpatrick, C. D. Gelatt and M. P. Vecchi,
Optimization by simulated annealing, Science, 220 (1983), 671-680.
doi: 10.1126/science.220.4598.671. |
[26] |
H. Kushner and G. G. Yin, Stochastic Approximation and Recursive Algorithms and Applications, Second edition. Applications of Mathematics (New York), 35. Stochastic Modelling and Applied Probability. Springer-Verlag, New York, 2003. |
[27] |
J. Lan, R. Liu, H. Zhou and J. Yosinski, LCA: Loss change allocation for neural network training, preprint, arXiv: 1909.01440, (2019). Google Scholar |
[28] |
B. Leimkuhler and C. Matthews, Molecular Dynamics: With Deterministic and Stochastic Numerical Methods, Interdisciplinary Applied Mathematics, Springer, 2015.
doi: 10.1007/978-3-319-16375-8. |
[29] |
B. Leimkuhler, C. Matthews and G. Stoltz,
The computation of averages from equilibrium and nonequilibrium Langevin molecular dynamics, IMA Journal of Numerical Analysis, 36 (2016), 13-79.
doi: 10.1093/imanum/dru056. |
[30] |
B. Leimkuhler, M. Sachs and G. Stoltz, Hypocoercivity properties of adaptive Langevin dynamics, preprint, arXiv: 1908.09363, (2019). Google Scholar |
[31] |
B. Leimkuhler and X. Shang, Adaptive thermostats for noisy gradient systems, SIAM Journal on Scientific Computing, 38 (2016), A712–A736.
doi: 10.1137/15M102318X. |
[32] |
E. Marinari and G. Parisi, Simulated tempering: A new Monte Carlo scheme, Europhysics Letters, 19 (1992).
doi: 10.1209/0295-5075/19/6/002. |
[33] |
J. C. Mattingly, A. M. Stuart and D. J. Higham,
Ergodicity for SDEs and approximations: locally Lipschitz vector fields and degenerate noise, Stochastic Processes and their Applications, 101 (2002), 185-232.
doi: 10.1016/S0304-4149(02)00150-3. |
[34] |
S. P. Meyn and R. L. Tweedie,
Stability of Markovian processes Ⅱ: Continuous-time processes and sampled chains, Advances in Applied Probability, 25 (1993), 487-517.
doi: 10.2307/1427521. |
[35] | K.P. Murphy, Machine Learning: A Probabilistic Perspective, MIT Press, 2012. Google Scholar |
[36] |
R. M. Neal, Bayesian Learning for Neural Networks, Springer-Verlag, New York, 1996.
doi: 10.1007/978-1-4612-0745-0. |
[37] |
B. Neyshabur, R. Tomioka and N. Srebro, In search of the real inductive bias: On the role of implicit regularization in deep learning, Proceeding of the International Conference on Learning Representations workshop track, arXiv: 1412.6614, (2015). Google Scholar |
[38] |
S. Nosé, A unified formulation of the constant temperature molecular dynamics methods, The Journal of Chemical Physics, 81 (1984), 511-519. Google Scholar |
[39] |
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga and A. Lerer, Automatic differentiation in PyTorch, (2017). Google Scholar |
[40] |
E. Pollak, A. Auerbach and P. Talkner, Observations on rate theory for rugged energy landscapes, Biophysical Journal, 95 (2008), 4258-4265. Google Scholar |
[41] |
G. O. Roberts and R. L. Tweedie,
Exponential convergence of Langevin distributions and their discrete approximations, Bernoulli, 2 (1996), 341-363.
doi: 10.2307/3318418. |
[42] |
M. Sachs, B. Leimkuhler and V. Danos, Langevin dynamics with variable coefficients and nonconservative forces: from stationary states to numerical methods, Entropy, 19 (2017), 647.
doi: 10.3390/e19120647. |
[43] |
L. Sagun, L. Bottou and Y. LeCun, Singularity of the Hessian in deep learning, ICLR, (2017). Google Scholar |
[44] |
L. Sagun, U. Evci, U. Güney, Y. Dauphin and L. Bottou, Empirical analysis of the Hessian of over-parametrized neural networks, ICLR, arXiv: 1706.04454, (2018). Google Scholar |
[45] |
K. T. Schütt, F. Arbabzadah, S. Chmiela, K. R. Müller and A. Tkatchenko, Quantum-chemical insights from deep tensor neural networks, Nature Communications, 8 (2017). Google Scholar |
[46] |
D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan and D. Hassabis,
A general reinforcement learning algorithm that masters Chess, Shogi, and Go through self-play, Science, 362 (2018), 1140-1144.
doi: 10.1126/science.aar6404. |
[47] |
B. Singh, S. De, Y. Zhang, T. Goldstein and G. Taylor, Layer-specific adaptive learning rates for deep networks, ICMLA, arXiv: 1510.04609, (2015).
doi: 10.1109/ICMLA.2015.113. |
[48] |
R. Tibshirani,
Regression shrinkage and selection via the Lasso, Journal of the Royal Statistical Society. Series B, 58 (1996), 267-288.
doi: 10.1111/j.2517-6161.1996.tb02080.x. |
[49] |
T. Tieleman and G. Hinton, Lecture 6.5 - RMSprop: Divide the gradient by a running average of its recent magnitude, COURSERA: Neural Networks for Machine Learning, (2012). Google Scholar |
[50] |
M. Welling and Y. W. Teh, Bayesian learning via stochastic gradient Langevin dynamics, Proceedings of the 28th International Conference on Machine Learning, (2011), 681–688. Google Scholar |
[51] |
P. Williams,
Bayesian regularization and pruning using a Laplace prior, Neural Computation, 7 (1995), 117-143.
doi: 10.1162/neco.1995.7.1.117. |
[52] |
A. C. Wilson, R. Roelofs, M. Stern, N. Srebro and B. Recht, The marginal value of adaptive gradient methods in machine learning, arXiv: 1705.08292, (2017). Google Scholar |
[53] |
B. Xu, N. Wang, T. Chen and M. Li, Empirical evaluation of rectified activations in convolutional network., CoRR, arXiv: 1505.00853, (2015). Google Scholar |
[54] |
M. Zeiler, ADADELTA: An adaptive learning rate method, CoRR, arXiv: 1212.5701, (2012). Google Scholar |
[55] |
C. Zhang, S. Bengio, M. Hardt, B. Recht and O. Vinyals, Understanding deep learning requires rethinking generalization, ICLR, arXiv: 1611.03530, (2017). Google Scholar |
[56] |
R. Zwanzig,
Diffusion in a rough potential, Proc. Natl. Acad. Sci. USA, 85 (1988), 2029-2030.
doi: 10.1073/pnas.85.7.2029. |
















[1] |
Ziang Long, Penghang Yin, Jack Xin. Global convergence and geometric characterization of slow to fast weight evolution in neural network training for classifying linearly non-separable data. Inverse Problems & Imaging, 2021, 15 (1) : 41-62. doi: 10.3934/ipi.2020077 |
[2] |
Predrag S. Stanimirović, Branislav Ivanov, Haifeng Ma, Dijana Mosić. A survey of gradient methods for solving nonlinear optimization. Electronic Research Archive, 2020, 28 (4) : 1573-1624. doi: 10.3934/era.2020115 |
[3] |
Editorial Office. Retraction: Honggang Yu, An efficient face recognition algorithm using the improved convolutional neural network. Discrete & Continuous Dynamical Systems - S, 2019, 12 (4&5) : 901-901. doi: 10.3934/dcdss.2019060 |
[4] |
Martin Heida, Stefan Neukamm, Mario Varga. Stochastic homogenization of $ \Lambda $-convex gradient flows. Discrete & Continuous Dynamical Systems - S, 2021, 14 (1) : 427-453. doi: 10.3934/dcdss.2020328 |
[5] |
Haodong Yu, Jie Sun. Robust stochastic optimization with convex risk measures: A discretized subgradient scheme. Journal of Industrial & Management Optimization, 2021, 17 (1) : 81-99. doi: 10.3934/jimo.2019100 |
[6] |
Bing Yu, Lei Zhang. Global optimization-based dimer method for finding saddle points. Discrete & Continuous Dynamical Systems - B, 2021, 26 (1) : 741-753. doi: 10.3934/dcdsb.2020139 |
[7] |
Zhimin Li, Tailei Zhang, Xiuqing Li. Threshold dynamics of stochastic models with time delays: A case study for Yunnan, China. Electronic Research Archive, 2021, 29 (1) : 1661-1679. doi: 10.3934/era.2020085 |
[8] |
Hui Zhao, Zhengrong Liu, Yiren Chen. Global dynamics of a chemotaxis model with signal-dependent diffusion and sensitivity. Discrete & Continuous Dynamical Systems - B, 2020 doi: 10.3934/dcdsb.2021011 |
[9] |
Shujing Shi, Jicai Huang, Yang Kuang. Global dynamics in a tumor-immune model with an immune checkpoint inhibitor. Discrete & Continuous Dynamical Systems - B, 2021, 26 (2) : 1149-1170. doi: 10.3934/dcdsb.2020157 |
[10] |
Hedy Attouch, Aïcha Balhag, Zaki Chbani, Hassan Riahi. Fast convex optimization via inertial dynamics combining viscous and Hessian-driven damping with time rescaling. Evolution Equations & Control Theory, 2021 doi: 10.3934/eect.2021010 |
[11] |
Xueli Bai, Fang Li. Global dynamics of competition models with nonsymmetric nonlocal dispersals when one diffusion rate is small. Discrete & Continuous Dynamical Systems - A, 2020, 40 (6) : 3075-3092. doi: 10.3934/dcds.2020035 |
[12] |
Xin Zhao, Tao Feng, Liang Wang, Zhipeng Qiu. Threshold dynamics and sensitivity analysis of a stochastic semi-Markov switched SIRS epidemic model with nonlinear incidence and vaccination. Discrete & Continuous Dynamical Systems - B, 2020 doi: 10.3934/dcdsb.2021010 |
[13] |
Lars Grüne. Computing Lyapunov functions using deep neural networks. Journal of Computational Dynamics, 2020 doi: 10.3934/jcd.2021006 |
[14] |
Leslaw Skrzypek, Yuncheng You. Feedback synchronization of FHN cellular neural networks. Discrete & Continuous Dynamical Systems - B, 2020 doi: 10.3934/dcdsb.2021001 |
[15] |
Qiang Fu, Yanlong Zhang, Yushu Zhu, Ting Li. Network centralities, demographic disparities, and voluntary participation. Mathematical Foundations of Computing, 2020, 3 (4) : 249-262. doi: 10.3934/mfc.2020011 |
[16] |
Bao Wang, Alex Lin, Penghang Yin, Wei Zhu, Andrea L. Bertozzi, Stanley J. Osher. Adversarial defense via the data-dependent activation, total variation minimization, and adversarial training. Inverse Problems & Imaging, 2021, 15 (1) : 129-145. doi: 10.3934/ipi.2020046 |
[17] |
Gabrielle Nornberg, Delia Schiera, Boyan Sirakov. A priori estimates and multiplicity for systems of elliptic PDE with natural gradient growth. Discrete & Continuous Dynamical Systems - A, 2020, 40 (6) : 3857-3881. doi: 10.3934/dcds.2020128 |
[18] |
Shipra Singh, Aviv Gibali, Xiaolong Qin. Cooperation in traffic network problems via evolutionary split variational inequalities. Journal of Industrial & Management Optimization, 2020 doi: 10.3934/jimo.2020170 |
[19] |
Yicheng Liu, Yipeng Chen, Jun Wu, Xiao Wang. Periodic consensus in network systems with general distributed processing delays. Networks & Heterogeneous Media, 2020 doi: 10.3934/nhm.2021002 |
[20] |
Rajendra K C Khatri, Brendan J Caseria, Yifei Lou, Guanghua Xiao, Yan Cao. Automatic extraction of cell nuclei using dilated convolutional network. Inverse Problems & Imaging, 2021, 15 (1) : 27-40. doi: 10.3934/ipi.2020049 |
Impact Factor:
Tools
Article outline
Figures and Tables
[Back to Top]