# American Institute of Mathematical Sciences

February  2021, 15(1): 41-62. doi: 10.3934/ipi.2020077

## Global convergence and geometric characterization of slow to fast weight evolution in neural network training for classifying linearly non-separable data

 1 Department of Mathematics, University of California, Irvine, CA 92697, USA 2 Department of Mathematics and Statistics, State University of New York at Albany, Albany, NY 12222, USA 3 Department of Mathematics, University of California, Irvine, Irvine, CA 92697, USA

* Corresponding author: Ziang Long

Received  November 2019 Revised  October 2020 Published  February 2021 Early access  December 2020

In this paper, we study the dynamics of gradient descent in learning neural networks for classification problems. Unlike in existing works, we consider the linearly non-separable case where the training data of different classes lie in orthogonal subspaces. We show that when the network has sufficient (but not exceedingly large) number of neurons, (1) the corresponding minimization problem has a desirable landscape where all critical points are global minima with perfect classification; (2) gradient descent is guaranteed to converge to the global minima. Moreover, we discovered a geometric condition on the network weights so that when it is satisfied, the weight evolution transitions from a slow phase of weight direction spreading to a fast phase of weight convergence. The geometric condition says that the convex hull of the weights projected on the unit sphere contains the origin.

Citation: Ziang Long, Penghang Yin, Jack Xin. Global convergence and geometric characterization of slow to fast weight evolution in neural network training for classifying linearly non-separable data. Inverse Problems and Imaging, 2021, 15 (1) : 41-62. doi: 10.3934/ipi.2020077
##### References:
 [1] Z. Allen-Zhu, Y. Li and Z. Song, A convergence theory for deep learning via over-parameterization, preprint, arXiv: 1811.03962. [2] A. Brutzkus and A. Globerson, Globally optimal gradient descent for a ConvNet with Gaussian inputs, preprint, arXiv: 1702.07966. [3] A. Brutzkus and A. Globerson, Over-parameterization improves generalization in the XOR detection problem, preprint. [4] A. Brutzkus, A. Globerson, E. Malach and S. Shalev-Shwartz, SGDlearns over-parameterized networks that provably generalize on linearly separable data, 6th International Conference on Learning Representations, Vancouver, BC, Canada, 2018 [5] R. T. des Combes, M. Pezeshki, S. Shabanian, A. Courville and Y. Bengio, Convergence properties of deep neural networks on separable data, 2019., Available from: https://openreview.net/forum?id=HJfQrs0qt7. [6] S. S. Du, X. Zhai, B. Poczós and A. Singh, Gradient descent provably optimizes over-parameterized neural networks, preprint, arXiv: 1810.02054. [7] C. Ho and S. Zimmerman, On the number of regions in an $m$-dimensional space cut by $n$ hyperplanes, Austral. Math. Soc. Gaz., 33 (2006), 260-264. [8] S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Comput., 9 (1997), 1735-1780.  doi: 10.1162/neco.1997.9.8.1735. [9] A. Krizhevsky, I. Sutskever and G. E. Hinton, ImageNet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems, 2012, 1097-1105. doi: 10.1145/3065386. [10] LeNet-5 - A Classic CNN Architecture., Available from: https://engmrk.com/lenet-5-a-classic-cnn-architecture/., [11] Y. Li and Y. Liang, Learning overparameterized neural networks via stochastic gradient descent on structured data, preprint, arXiv: 1808.01204. [12] S. Liang, R. Sun, Y. Li and R. Srikant, Understanding the loss surface of neural networks for binary classification, preprint, arXiv: 1803.00909. [13] B. Neyshabur, Z. Li, S. Bhojanapalli, Y. LeCun and N. Srebro, Towards understanding the role of over-parametrization in generalization of neural networks, preprint, arXiv: 1805.12076. [14] Q. Nguyen, M. C. Mukkamala and M. Hein, On the loss landscape of a class of deep neural networks with no bad local valleys, preprint, arXiv: 1809.10749. [15] S. Ren, K. He, R. Girshick and J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, IEEE Trans. Pattern Anal. Machine Intell., 39 (2017), 1137-1149.  doi: 10.1109/TPAMI.2016.2577031. [16] A. Rosebrock, LeNet - Convolutional neural network in Python, 2016., Available from: https://www.pyimagesearch.com/2016/08/01/lenet-convolutional-neural-network-in-python/. [17] D. Silver, A. Huang, C. J. Maddison, A. Guez and L. et al. Sifre, Mastering the game of Go with deep neural networks and tree search, Nature, 529 (2016), 484-489.  doi: 10.1038/nature16961. [18] H. Wang, Y. Wang, Z. Zhou, X. Ji and D. Gong, et al., CosFace: Large margin cosine loss for deep face recognition, preprint, arXiv: 1801.09414. doi: 10.1109/CVPR.2018.00552. [19] P. Yin, J. Xin and Y. Qi, Linear feature transform and enhancement of classification on deep neural network, J. Sci. Comput., 76 (2018), 1396-1406.  doi: 10.1007/s10915-018-0666-1.

show all references

##### References:
 [1] Z. Allen-Zhu, Y. Li and Z. Song, A convergence theory for deep learning via over-parameterization, preprint, arXiv: 1811.03962. [2] A. Brutzkus and A. Globerson, Globally optimal gradient descent for a ConvNet with Gaussian inputs, preprint, arXiv: 1702.07966. [3] A. Brutzkus and A. Globerson, Over-parameterization improves generalization in the XOR detection problem, preprint. [4] A. Brutzkus, A. Globerson, E. Malach and S. Shalev-Shwartz, SGDlearns over-parameterized networks that provably generalize on linearly separable data, 6th International Conference on Learning Representations, Vancouver, BC, Canada, 2018 [5] R. T. des Combes, M. Pezeshki, S. Shabanian, A. Courville and Y. Bengio, Convergence properties of deep neural networks on separable data, 2019., Available from: https://openreview.net/forum?id=HJfQrs0qt7. [6] S. S. Du, X. Zhai, B. Poczós and A. Singh, Gradient descent provably optimizes over-parameterized neural networks, preprint, arXiv: 1810.02054. [7] C. Ho and S. Zimmerman, On the number of regions in an $m$-dimensional space cut by $n$ hyperplanes, Austral. Math. Soc. Gaz., 33 (2006), 260-264. [8] S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Comput., 9 (1997), 1735-1780.  doi: 10.1162/neco.1997.9.8.1735. [9] A. Krizhevsky, I. Sutskever and G. E. Hinton, ImageNet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems, 2012, 1097-1105. doi: 10.1145/3065386. [10] LeNet-5 - A Classic CNN Architecture., Available from: https://engmrk.com/lenet-5-a-classic-cnn-architecture/., [11] Y. Li and Y. Liang, Learning overparameterized neural networks via stochastic gradient descent on structured data, preprint, arXiv: 1808.01204. [12] S. Liang, R. Sun, Y. Li and R. Srikant, Understanding the loss surface of neural networks for binary classification, preprint, arXiv: 1803.00909. [13] B. Neyshabur, Z. Li, S. Bhojanapalli, Y. LeCun and N. Srebro, Towards understanding the role of over-parametrization in generalization of neural networks, preprint, arXiv: 1805.12076. [14] Q. Nguyen, M. C. Mukkamala and M. Hein, On the loss landscape of a class of deep neural networks with no bad local valleys, preprint, arXiv: 1809.10749. [15] S. Ren, K. He, R. Girshick and J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, IEEE Trans. Pattern Anal. Machine Intell., 39 (2017), 1137-1149.  doi: 10.1109/TPAMI.2016.2577031. [16] A. Rosebrock, LeNet - Convolutional neural network in Python, 2016., Available from: https://www.pyimagesearch.com/2016/08/01/lenet-convolutional-neural-network-in-python/. [17] D. Silver, A. Huang, C. J. Maddison, A. Guez and L. et al. Sifre, Mastering the game of Go with deep neural networks and tree search, Nature, 529 (2016), 484-489.  doi: 10.1038/nature16961. [18] H. Wang, Y. Wang, Z. Zhou, X. Ji and D. Gong, et al., CosFace: Large margin cosine loss for deep face recognition, preprint, arXiv: 1801.09414. doi: 10.1109/CVPR.2018.00552. [19] P. Yin, J. Xin and Y. Qi, Linear feature transform and enhancement of classification on deep neural network, J. Sci. Comput., 76 (2018), 1396-1406.  doi: 10.1007/s10915-018-0666-1.
Geometric Condition in Lemma 5.2 ($d = 3$)
2-dim section of $\mathbb{R}^d$ spanned by $\tilde{{\mathit{\boldsymbol{w}}}}_1$ and ${\mathit{\boldsymbol{n}}}$
Number of iterations to convergence v.s. $\theta$, the anlge between subspaces $V_1$ and $V_2$
Left: convergent iterations vs. number of neurons ($d = 2$). Right: histogram of norm of weights: $\max\limits_{t}\left| {{\mathit{\boldsymbol{W}}}}^{t}\right|$ ($d = 2$ and $k = 4$)
Dynamics of weights: $\tilde{{\mathit{\boldsymbol{w}}}}_j$ and ${\mathit{\boldsymbol{u}}}_j$
Left: Slow-to-Fast transition during LeNet [16] training on MNIST dataset. Right: 2D projections of MNIST features from a trained convolutional neural network [18]. The 10 classes are color coded, the feature points cluster near linearly independent subspaces
Top row: Projections onto $\mathcal{S}^2$ (inside randomly selected 3D subspaces) of weight vectors in the first fully connected layer of a trained LeNet. Bottom row: Projections onto $\mathcal{S}^2$ (inside randomly selected 3D subspaces) of weight vectors and their convex hull in the second fully connected layer of a trained LeNet
Iterations taken ($\text{mean}\pm\text{std}$) to convergence with random and half space initializations
 # of Neurons ($2k$) Random Init. Half Space Init. 6 578.90$\pm$205.43 672.41$\pm$226.53 8 423.96$\pm$190.91 582.16$\pm$200.81 10 313.29$\pm$178.67 550.19$\pm$180.59 12 242.72$\pm$178.94 517.26$\pm$172.46 14 183.53$\pm$108.60 500.42$\pm$215.87 16 141.00$\pm$80.66 487.42$\pm$220.48 18 126.52$\pm$62.07 478.25$\pm$202.71 20 102.09$\pm$32.32 412.46$\pm$195.92 22 90.65$\pm$28.01 454.08$\pm$203.00 24 82.93$\pm$26.76 416.82$\pm$216.58
 # of Neurons ($2k$) Random Init. Half Space Init. 6 578.90$\pm$205.43 672.41$\pm$226.53 8 423.96$\pm$190.91 582.16$\pm$200.81 10 313.29$\pm$178.67 550.19$\pm$180.59 12 242.72$\pm$178.94 517.26$\pm$172.46 14 183.53$\pm$108.60 500.42$\pm$215.87 16 141.00$\pm$80.66 487.42$\pm$220.48 18 126.52$\pm$62.07 478.25$\pm$202.71 20 102.09$\pm$32.32 412.46$\pm$195.92 22 90.65$\pm$28.01 454.08$\pm$203.00 24 82.93$\pm$26.76 416.82$\pm$216.58
 [1] Yacine Chitour, Zhenyu Liao, Romain Couillet. A geometric approach of gradient descent algorithms in linear neural networks. Mathematical Control and Related Fields, 2022  doi: 10.3934/mcrf.2022021 [2] Luca Dieci, Cinzia Elia. Smooth to discontinuous systems: A geometric and numerical method for slow-fast dynamics. Discrete and Continuous Dynamical Systems - B, 2018, 23 (7) : 2935-2950. doi: 10.3934/dcdsb.2018112 [3] Shishun Li, Zhengda Huang. Guaranteed descent conjugate gradient methods with modified secant condition. Journal of Industrial and Management Optimization, 2008, 4 (4) : 739-755. doi: 10.3934/jimo.2008.4.739 [4] Ricai Luo, Honglei Xu, Wu-Sheng Wang, Jie Sun, Wei Xu. A weak condition for global stability of delayed neural networks. Journal of Industrial and Management Optimization, 2016, 12 (2) : 505-514. doi: 10.3934/jimo.2016.12.505 [5] Gaohang Yu, Lutai Guan, Guoyin Li. Global convergence of modified Polak-Ribière-Polyak conjugate gradient methods with sufficient descent property. Journal of Industrial and Management Optimization, 2008, 4 (3) : 565-579. doi: 10.3934/jimo.2008.4.565 [6] Leong-Kwan Li, Sally Shao. Convergence analysis of the weighted state space search algorithm for recurrent neural networks. Numerical Algebra, Control and Optimization, 2014, 4 (3) : 193-207. doi: 10.3934/naco.2014.4.193 [7] Andrea Giorgini. On the Swift-Hohenberg equation with slow and fast dynamics: well-posedness and long-time behavior. Communications on Pure and Applied Analysis, 2016, 15 (1) : 219-241. doi: 10.3934/cpaa.2016.15.219 [8] Chunhua Shan. Slow-fast dynamics and nonlinear oscillations in transmission of mosquito-borne diseases. Discrete and Continuous Dynamical Systems - B, 2022, 27 (3) : 1447-1469. doi: 10.3934/dcdsb.2021097 [9] Jui-Pin Tseng. Global asymptotic dynamics of a class of nonlinearly coupled neural networks with delays. Discrete and Continuous Dynamical Systems, 2013, 33 (10) : 4693-4729. doi: 10.3934/dcds.2013.33.4693 [10] King Hann Lim, Hong Hui Tan, Hendra G. Harno. Approximate greatest descent in neural network optimization. Numerical Algebra, Control and Optimization, 2018, 8 (3) : 327-336. doi: 10.3934/naco.2018021 [11] Ying Sue Huang. Resynchronization of delayed neural networks. Discrete and Continuous Dynamical Systems, 2001, 7 (2) : 397-401. doi: 10.3934/dcds.2001.7.397 [12] C. Connell Mccluskey. Lyapunov functions for tuberculosis models with fast and slow progression. Mathematical Biosciences & Engineering, 2006, 3 (4) : 603-614. doi: 10.3934/mbe.2006.3.603 [13] Delio Mugnolo, René Pröpper. Gradient systems on networks. Conference Publications, 2011, 2011 (Special) : 1078-1090. doi: 10.3934/proc.2011.2011.1078 [14] Xiaming Chen. Kernel-based online gradient descent using distributed approach. Mathematical Foundations of Computing, 2019, 2 (1) : 1-9. doi: 10.3934/mfc.2019001 [15] Ting Hu. Kernel-based maximum correntropy criterion with gradient descent method. Communications on Pure and Applied Analysis, 2020, 19 (8) : 4159-4177. doi: 10.3934/cpaa.2020186 [16] Feng Bao, Thomas Maier. Stochastic gradient descent algorithm for stochastic optimization in solving analytic continuation problems. Foundations of Data Science, 2020, 2 (1) : 1-17. doi: 10.3934/fods.2020001 [17] Wataru Nakamura, Yasushi Narushima, Hiroshi Yabe. Nonlinear conjugate gradient methods with sufficient descent properties for unconstrained optimization. Journal of Industrial and Management Optimization, 2013, 9 (3) : 595-619. doi: 10.3934/jimo.2013.9.595 [18] Tatyana S. Turova. Structural phase transitions in neural networks. Mathematical Biosciences & Engineering, 2014, 11 (1) : 139-148. doi: 10.3934/mbe.2014.11.139 [19] Hedy Attouch, Alexandre Cabot, Zaki Chbani, Hassan Riahi. Rate of convergence of inertial gradient dynamics with time-dependent viscous damping coefficient. Evolution Equations and Control Theory, 2018, 7 (3) : 353-371. doi: 10.3934/eect.2018018 [20] Wei-Zhe Gu, Li-Yong Lu. The linear convergence of a derivative-free descent method for nonlinear complementarity problems. Journal of Industrial and Management Optimization, 2017, 13 (2) : 531-548. doi: 10.3934/jimo.2016030

2020 Impact Factor: 1.639