We study the effect of normalization on the layers of deep neural networks of feed-forward type. A given layer $ i $ with $ N_{i} $ hidden units is allowed to be normalized by $ 1/N_{i}^{\gamma_{i}} $ with $ \gamma_{i}\in[1/2,1] $ and we study the effect of the choice of the $ \gamma_{i} $ on the statistical behavior of the neural network's output (such as variance) as well as on the test accuracy on the MNIST data set. We find that in terms of variance of the neural network's output and test accuracy the best choice is to choose the $ \gamma_{i} $'s to be equal to one, which is the mean-field scaling. We also find that this is particularly true for the outer layer, in that the neural network's behavior is more sensitive in the scaling of the outer layer as opposed to the scaling of the inner layers. The mechanism for the mathematical analysis is an asymptotic expansion for the neural network's output. An important practical consequence of the analysis is that it provides a systematic and mathematically informed way to choose the learning rate hyperparameters. Such a choice guarantees that the neural network behaves in a statistically robust way as the $ N_i $ grow to infinity.
Citation: |
[1] | B. Alipanahi, A. Delong, M. T. Weirauch and B. J. Frey, Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning, Nature Biotechnology, 33 (2015), 831-838. doi: 10.1038/nbt.3300. |
[2] | D. Araújo, R. I. Oliveira and D. Yukimura, A mean-field limit for certain deep neural networks, 2019, arXiv: 1906.00193. |
[3] | S. Ö. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, A. Ng, J. Raiman and S. Sengputa, Deep voice: Real-time neural text-to-speech, arXiv: 1702.07825, 2017. |
[4] | A. R. Barron, Approximation and estimation bounds for artificial neural networks, Machine Learning, 14 (1994), 115-133. doi: 10.1016/B978-1-55860-213-7.50025-0. |
[5] | P. Bartlett, D. Foster and M. Telgarsky, Spectrally-normalized margin bounds for neural networks, Advances in Neural Information Processing Systems, (2017), 6241-6250. |
[6] | M. Bojarski, D. Del Test, D. Dworakowski, B. Firnier, B. Flepp, P. Goyal, L. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao and K. Zieba, End to end learning for self-driving cars, arXiv: 1604.07316, 2016. |
[7] | L. Chizat and F. Bach, On the global convergence of gradient descent for over-parameterized models using optimal transport, Advances in Neural Information Processing Systems (NeurIPS), (2018), 3040-3050. |
[8] | S. Du, J. Lee, H. Li, L. Wang and X. Zhai, Gradient Descent finds global minima of deep neural networks, Proceedings of the 36th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. |
[9] | S. Du, X. Zhai, B. Poczos and A. Singh, Gradient Descent provably optimizes over-parameterized neural networks, ICLR, 2019. |
[10] | A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau and S. Thrun, Dermatologist-level classification of skin cancer with deep neural networks, Nature, 542 (2017), 115-118. doi: 10.1038/nature21056. |
[11] | S. N. Ethier and T. G. Kurtz, Markov Processes: Characterization and Convergence, 1986, Wiley, New York. doi: 10.1002/9780470316658. |
[12] | M. Geiger, A. Jacot, S. Spigler, F. Gabriel, L. Sagun, S. d'Ascoli, G. Biroli, C. Hongler and M. Wyart, Scaling description of generalization with number of parameters in deep learning, J. Stat. Mech. Theory Exp., (2020), 023401, 23 pp. arXiv: 1901.01608, 2019 doi: 10.1088/1742-5468/ab633c. |
[13] | X. Glorot and Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, (2010), 249-256. |
[14] | I. Goodfellow, Y. Bengio and A. Courville, Deep Learning, Cambridge: MIT Press, 2016. |
[15] | S. Gu, E. Holly, T. Lillicrap and S. Levine, Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates, IEEE Conference on Robotics and Automation, (2017), 3389-3396. doi: 10.1109/ICRA.2017.7989385. |
[16] | K. Hornik, Approximation capabilities of multilayer feedforward networks, Neural Networks, 4 (1991), 251-257. doi: 10.1016/0893-6080(91)90009-T. |
[17] | K. Hornik, M. Stinchcombe and H. White, Multilayer feedforward networks are universal approximators, Neural Networks, 2 (1989), 359-366. doi: 10.1016/0893-6080(89)90020-8. |
[18] | J. Huang and H. T. Yau, Dynamics of deep neural networks and neural tangent hierarchy, In International Conference on Machine Learning, PMLR, (2020), 4542-4551. |
[19] | Y. Ito, Nonlinearity creates linear independence, Advances in Computational Mathematics, 5 (1996), 189-203. doi: 10.1007/BF02124743. |
[20] | A. Jacot, F. Gabriel and C. Hongler, Neural tangent kernel: Convergence and generalization in neural networks, 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montreal, Canada. |
[21] | A. Krizhevsky, Learning multiple layers of features from tiny images, Technical Report, 2009. |
[22] | C. Kuan and K. Hornik, Convergence of learning algorithms with constant learning rates, IEEE Transactions on Neural Networks, 2 (1991), 484-489. doi: 10.1109/72.134285. |
[23] | H. J. Kushner and G. G. Yin, Stochastic Approximation and Recurisve Algorithms and Applications, Springer-Verlag, New York, 2003. |
[24] | Y. LeCun, Y. Bengio and G. Hinton, Deep learning, Nature, 521 (2015), 436-444. doi: 10.1038/nature14539. |
[25] | Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, 86 (1998), 2278-2324. doi: 10.1109/5.726791. |
[26] | Y. Leviathan and Y. Matias, Google duplex: An AI system for accomplishing real-world tasks over the phone, Google, 2018. |
[27] | J. Ling, A. Kurzawski and J. Templeton, Reynolds averaged turbulence modelling using deep neural networks with embedded invariance, Journal of Fluid Mechanics, 807 (2016), 155-166. doi: 10.1017/jfm.2016.615. |
[28] | J. Ling, R. Jones and J. Templeton, Machine learning strategies for systems with invariance properties, Journal of Computational Physics, 318 (2016), 22-35. doi: 10.1016/j.jcp.2016.05.003. |
[29] | S. Mallat, Understanding deep convolutional neural networks, Philosophical Transactions of the Royal Society A, 374 (2016), 20150203. doi: 10.1098/rsta.2015.0203. |
[30] | S. Mei, A. Montanari and P.-M. Nguyen, A mean field view of the landscape of two-layer neural networks, Proceedings of the National Academy of Sciences, 11 (2018), E7665-E7671. doi: 10.1073/pnas.1806579115. |
[31] | O. Moynot and M. Samuelides, Large deviations and mean-field theory for asymmetric random recurrent neural networks, Probability Theory and Related Fields, 123 (2002), 41-75. doi: 10.1007/s004400100182. |
[32] | B. Neal, S. Mittal, A. Baratin, V. Tantia, M. Scicluna, S. Lacoste-Julien and I. Mitliagkas, A modern take on the bias-variance tradeoff in neural networks, arXiv: 1810.08591, 2018. |
[33] | P.-M. Nguyen, Mean field limit of the learning dynamics of multilayer neural networks, 2019, arXiv: 1902.02880. |
[34] | H. A. Pierson and M. S. Gashler, Deep learning in robotics: A review of recent research, Advanced Robotics, 31 (2017), 821-835. doi: 10.1080/01691864.2017.1365009. |
[35] | G. M. Rotskoff and E. Vanden-Eijnden, Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error, arXiv: 1805.00915, 2018. |
[36] | J. Sirignano and R. Cont, Universal features of price formation in financial markets: Perspectives from deep Learning, Quantitative Finance, 19 (2019), 1449-1459. doi: 10.1080/14697688.2019.1622295. |
[37] | J. Sirignano, A. Sadhwani and K. Giesecke, Deep learning for mortgage risk, arXiv: 1607.02470, 2016. |
[38] | J. Sirignano and K. Spiliopoulos, DGM: A deep learning algorithm for solving partial differential equations, Journal of Computational Physics, 375 (2018), 1339-1364. doi: 10.1016/j.jcp.2018.08.029. |
[39] | J. Sirignano and K. Spiliopoulos, Mean field analysis of neural networks: A law of large numbers, SIAM Journal on Applied Mathematics, 80 (2020), 725–752. doi: 10.1137/18M1192184. |
[40] | J. Sirignano and K. Spiliopoulos, Mean field analysis of neural networks: A central limit theorem, Stochastic Processes and their Applications, 130 (2020), 1820-1852. doi: 10.1016/j.spa.2019.06.003. |
[41] | J. Sirignano and K. Spiliopoulos, Mean field analysis of deep neural networks, Mathematics of Operations Research, 47 (2021), 120-152. doi: 10.1287/moor.2020.1118. |
[42] | J. Sirignano and K. Spiliopoulos, Asymptotics of reinforcement learning with neural networks, Stochastic Systems, 12 (2022), 2-29. doi: 10.1287/stsy.2021.0072. |
[43] | Y. Taigman, M. Yang, M. Ranzato and L. Wolf, Deepface: Closing the gap to human-level performance in face verification, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2014), 1701-1708. doi: 10.1109/CVPR.2014.220. |
[44] | M. Telgarsky, Benefits of depth in neural networks, arXiv: 1602.04485, 2016. |
[45] | J. Yu and K. Spiliopoulos, Normalization effects on shallow neural networks and related asymptotic expansions, AIMS Journal on Foundations of Data Science, 3 (2021), 151-200. doi: 10.3934/fods.2021013. |
[46] | Y. Zhang, W. Chan and N. Jaitly, Very deep convolutional networks for end-to-end speech recognition, In IEEE International Conference on Acoustics, Speech, and Signal Processing, (2017), 4845-4849. doi: 10.1109/ICASSP.2017.7953077. |
[47] | D. Zou, Y. Cao, D. Zhou and Q. Gu, Stochastic gradient descent optimizes over-parameterized deep ReLU networks, arXiv: 1811.08888, 2018. |
Performance of scaled neural networks on MNIST test dataset: cross entropy loss,
Performance of scaled neural networks on MNIST test dataset: cross entropy loss,
Performance of scaled neural networks on MNIST test dataset: cross entropy loss, batch size
Performance of scaled neural networks on MNIST test dataset: cross entropy loss,
Performance of scaled neural networks on MNIST test dataset: cross entropy loss,
Performance of scaled neural networks on MNIST test dataset: cross entropy loss,
Performance of scaled neural networks on MNIST test dataset: cross entropy loss,