Article Contents
Article Contents

# Normalization effects on shallow neural networks and related asymptotic expansions

• * Corresponding author: Jiahui Yu
K.S. was partially supported by the National Science Foundation (DMS 1550918) and Simons Foundation Award 672441
• We consider shallow (single hidden layer) neural networks and characterize their performance when trained with stochastic gradient descent as the number of hidden units $N$ and gradient descent steps grow to infinity. In particular, we investigate the effect of different scaling schemes, which lead to different normalizations of the neural network, on the network's statistical output, closing the gap between the $1/\sqrt{N}$ and the mean-field $1/N$ normalization. We develop an asymptotic expansion for the neural network's statistical output pointwise with respect to the scaling parameter as the number of hidden units grows to infinity. Based on this expansion, we demonstrate mathematically that to leading order in $N$, there is no bias-variance trade off, in that both bias and variance (both explicitly characterized) decrease as the number of hidden units increases and time grows. In addition, we show that to leading order in $N$, the variance of the neural network's statistical output decays as the implied normalization by the scaling parameter approaches the mean field normalization. Numerical studies on the MNIST and CIFAR10 datasets show that test and train accuracy monotonically improve as the neural network's normalization gets closer to the mean field normalization.

Mathematics Subject Classification: Primary: 60F05, 68T01, 60G99.

 Citation:

• Figure 1.  Performance of scaled neural networks on MNIST test dataset (cross entropy loss)

Figure 2.  Performance of scaled neural networks on MNIST test dataset (MSE loss)

Figure 3.  Performance of scaled convolutional neural networks on CIFAR10 test dataset (cross entropy loss)

Figure 4.  Performance of scaled neural networks on MNIST training dataset (MSE loss)

•  [1] B. Alipanahi, A. Delong, M. Weirauch and B. Frey, Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning, Nature Biotechnology, 33 (2015), 831-838.  doi: 10.1038/nbt.3300. [2] S. Arik, M. Chrzanowski, A. Coates, G. Diamos and A. Gibiansky, et al., Deep voice: Real-time neural text-to-speech, preprint, arXiv: 1702.07825. [3] A. Barron, Approximation and estimation bounds for artificial neural networks, Machine Learning, 14 (1994), 115-133. doi: 10.1007/BF00993164. [4] P. Bartlett, D. Foster and M. Telgarsky, Spectrally-normalized margin bounds for neural networks, Adv. Neural Information Processing Systems (NeurIPS), 30 (2017), 6240-6249. [5] M. Bojarski, D. Del Test, D. Dworakowski, B. Firnier and B. Flepp, et al., End to end learning for self-driving cars, preprint, arXiv: 1604.07316. [6] L. Chizat and F. Bach, On the global convergence of gradient descent for over-parameterized models using optimal transport, Adv. Neural Information Processing Systems (NeurIPS), 31 (2018), 3036–3046. Available from: https://papers.nips.cc/paper/2018/file/a1afc58c6ca9540d057299ec3016d726-Paper.pdf. [7] S. Du, J. Lee, H. Li, L. Wang and X. Zhai, Gradient descent finds global minima of deep neural networks, International Conference on Machine Learning, Long Beach, CA, 2019. [8] S. Du, X. Zhai, B. Poczos and A. Singh, Gradient descent provably optimizes over-parameterized neural networks, International Conference on Learing Representation, 2019. Available from: https://openreview.net/forum?id=S1eK3i09YQ. [9] A. Esteva, B. Kuprel, R. Novoa, J. Ko, S. Swetter, H. Blau and S. Thrun, Dermatologist-level classification of skin cancer with deep neural networks, Nature, 542 (2017), 115-118.  doi: 10.1038/nature21056. [10] S. N. Ethier and T. G. Kurtz, Markov Processes: Characterization and Convergence, Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics, John Wiley & Sons, Inc., New York, 1986. doi: 10.1002/9780470316658. [11] M. Geiger, A. Jacot, S. Spigler, F. Gabriel and L. Sagun, et al., Scaling description of generalization with number of parameters in deep learning, J. Stat. Mech. Theory Exp., 2020 (2020), 23pp. doi: 10.1088/1742-5468/ab633c. [12] X. Glorot and Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010,249–256. [13] I. Goodfellow, Y. Bengio and A. Courville, Deep Learning, Adaptive Computation and Machine Learning, MIT Press, Cambridge, MA, 2016. [14] S. Gu, E. Holly, T. Lillicrap and S. Levine, Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates, IEEE Conference on Robotics and Automation, Singapore, 2017. doi: 10.1109/ICRA.2017.7989385. [15] K. Hornik, Approximation capabilities of multilayer feedforward networks, Neural Networks, 4 (1991), 251-257.  doi: 10.1016/0893-6080(91)90009-T. [16] K. Hornik, M. Stinchcombe and H. White, Multilayer feedforward networks are universal approximators, Neural Networks, 2 (1989), 359-366.  doi: 10.1016/0893-6080(89)90020-8. [17] J. Huang and H. T. Yau, Dynamics of deep neural networks and neural tangent hierarchy, Proceedings of the 37th International Conference on Machine Learning, PMLR, 119 (2020), 4542-4551. [18] Y. Ito, Nonlinearity creates linear independence, Adv. Comput. Math., 5 (1996), 189-203.  doi: 10.1007/BF02124743. [19] A. Jacot, F. Gabriel and C. Hongler, Neural tangent kernel: Convergence and generalization in neural networks, $32^{nd}$ Conference on Neural Information Processing Systems (NeurIPS), 2018. [20] A. Krizhevsky, Learning Multiple Layers of Features from Tiny Images, Technical Report, 2009. [21] C.-M. Kuan and K. Hornik, Convergence of learning algorithms with constant learning rates, IEEE Transactions on Neural Networks, 2 (1991), 484-489.  doi: 10.1109/72.134285. [22] H. J. Kushner and G. G. Yin, Stochastic Approximation and Recurisve Algorithms and Applications, Stochastic Modelling and Applied Probability, 35, Springer-Verlag, New York, 2003. doi: 10.1007/b97441. [23] Y. LeCun, Y. Bengio and G. Hinton, Deep learning, Nature, 521 (2015), 436-444.  doi: 10.1038/nature14539. [24] Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, 86 (1998), 2278-2324.  doi: 10.1109/5.726791. [25] Y. Leviathan and Y. Matias, Google duplex: An AI system for accomplishing real-world tasks over the phone, Google Research, 2018. Available from: https://ai.googleblog.com/2018/05/duplex-ai-system-for-natural-conversation.html. [26] J. Ling, R. Jones and J. Templeton, Machine learning strategies for systems with invariance properties, J. Comput. Phys., 318 (2016), 22-35.  doi: 10.1016/j.jcp.2016.05.003. [27] J. Ling, A. Kurzawski and J. Templeton, Reynolds averaged turbulence modelling using deep neural networks with embedded invariance, J. Fluid Mech, 807 (2016), 155-166.  doi: 10.1017/jfm.2016.615. [28] S. Mallat, Understanding deep convolutional neural networks, Philos. Transac. Roy. Soc. A, 374 (2016). doi: 10.1098/rsta.2015.0203. [29] S. Mei, A. Montanari and P.-M. Nguyen, A mean field view of the landscape of two-layer neural networks, Proc. Natl. Acad. Sci. USA, 115 (2018), E7665-E7671. doi: 10.1073/pnas.1806579115. [30] O. Moynot and M. Samuelides, Large deviations and mean-field theory for asymmetric random recurrent neural networks, Probab. Theory Related Fields, 123 (2002), 41-75.  doi: 10.1007/s004400100182. [31] B. Neal, S. Mittal, A. Baratin, V. Tantia, M. Scicluna, S. Lacoste-Julien and I. Mitliagkas, A modern take on the bias-variance tradeoff in neural networks, preprint, arXiv: 1810.08591. [32] H. Pierson and M. Gashler, Deep learning in robotics: A review of recent research, Advanced Robotics, 31 (2017), 821-835.  doi: 10.1080/01691864.2017.1365009. [33] G. M. Rotskoff and E. Vanden-Eijnden, Trainability and accuracy of neural networks: An interacting particle system approach, preprint, arXiv: 1805.00915. [34] J. Sirignano and R. Cont, Universal features of price formation in financial markets: Perspectives from deep learning, Quant. Finance, 19 (2019), 1449-1459.  doi: 10.1080/14697688.2019.1622295. [35] J. Sirignano, A. Sadhwani and K. Giesecke, Deep learning for mortgage risk, preprint, arXiv: 1607.02470. doi: 10.2139/ssrn.2799443. [36] J. Sirignano and K. Spiliopoulos, Asymptotics of reinforcement learning with neural networks, Stochastic Systems, to appear. [37] J. Sirignano and K. Spiliopoulos, DGM: A deep learning algorithm for solving partial differential equations, J. Comput. Phys., 375 (2018), 1339-1364.  doi: 10.1016/j.jcp.2018.08.029. [38] J. Sirignano and K. Spiliopoulos, Mean field analysis of deep neural networks, Math. Oper. Res., (2021). [39] J. Sirignano and K. Spiliopoulos, Mean field analysis of neural networks: A central limit theorem, Stochastic Process. Appl., 130 (2020), 1820-1852.  doi: 10.1016/j.spa.2019.06.003. [40] J. Sirignano and K. Spiliopoulos, Mean field analysis of neural networks: A law of large numbers, SIAM J. Appl. Math., 80 (2020), 725-752.  doi: 10.1137/18M1192184. [41] Y. Taigman, M. Yang, M. Ranzato and L. Wolf, DeepFace: Closing the gap to human-level performance in face verification, 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, 2014. doi: 10.1109/CVPR.2014.220. [42] M. Telgarsky, Benefits of depth in neural networks, preprint, arXiv: 1602.04485. [43] Y. Zhang, W. Chan and N. Jaitly, Very deep convolutional networks for end-to-end speech recognition, IEEE International Conference on Acoustics, Speech, and Signal Processing, New Orleans, LA, 2017. doi: 10.1109/ICASSP.2017.7953077. [44] D. Zou, Y. Cao, D. Zhou and Q. Gu, Gradient descent optimizes over-parameterized deep ReLU networks, Mach. Learn., 109 (2020), 467–492. doi: 10.1007/s10994-019-05839-6.

Figures(4)