# American Institute of Mathematical Sciences

February  2021, 15(1): 41-62. doi: 10.3934/ipi.2020077

## Global convergence and geometric characterization of slow to fast weight evolution in neural network training for classifying linearly non-separable data

 1 Department of Mathematics, University of California, Irvine, CA 92697, USA 2 Department of Mathematics and Statistics, State University of New York at Albany, Albany, NY 12222, USA 3 Department of Mathematics, University of California, Irvine, Irvine, CA 92697, USA

* Corresponding author: Ziang Long

Received  November 2019 Revised  October 2020 Published  December 2020

In this paper, we study the dynamics of gradient descent in learning neural networks for classification problems. Unlike in existing works, we consider the linearly non-separable case where the training data of different classes lie in orthogonal subspaces. We show that when the network has sufficient (but not exceedingly large) number of neurons, (1) the corresponding minimization problem has a desirable landscape where all critical points are global minima with perfect classification; (2) gradient descent is guaranteed to converge to the global minima. Moreover, we discovered a geometric condition on the network weights so that when it is satisfied, the weight evolution transitions from a slow phase of weight direction spreading to a fast phase of weight convergence. The geometric condition says that the convex hull of the weights projected on the unit sphere contains the origin.

Citation: Ziang Long, Penghang Yin, Jack Xin. Global convergence and geometric characterization of slow to fast weight evolution in neural network training for classifying linearly non-separable data. Inverse Problems & Imaging, 2021, 15 (1) : 41-62. doi: 10.3934/ipi.2020077
##### References:
 [1] Z. Allen-Zhu, Y. Li and Z. Song, A convergence theory for deep learning via over-parameterization, preprint, arXiv: 1811.03962. Google Scholar [2] A. Brutzkus and A. Globerson, Globally optimal gradient descent for a ConvNet with Gaussian inputs, preprint, arXiv: 1702.07966. Google Scholar [3] A. Brutzkus and A. Globerson, Over-parameterization improves generalization in the XOR detection problem, preprint. Google Scholar [4] A. Brutzkus, A. Globerson, E. Malach and S. Shalev-Shwartz, SGDlearns over-parameterized networks that provably generalize on linearly separable data, 6th International Conference on Learning Representations, Vancouver, BC, Canada, 2018 Google Scholar [5] R. T. des Combes, M. Pezeshki, S. Shabanian, A. Courville and Y. Bengio, Convergence properties of deep neural networks on separable data, 2019., Available from: https://openreview.net/forum?id=HJfQrs0qt7. Google Scholar [6] S. S. Du, X. Zhai, B. Poczós and A. Singh, Gradient descent provably optimizes over-parameterized neural networks, preprint, arXiv: 1810.02054. Google Scholar [7] C. Ho and S. Zimmerman, On the number of regions in an $m$-dimensional space cut by $n$ hyperplanes, Austral. Math. Soc. Gaz., 33 (2006), 260-264.   Google Scholar [8] S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Comput., 9 (1997), 1735-1780.  doi: 10.1162/neco.1997.9.8.1735.  Google Scholar [9] A. Krizhevsky, I. Sutskever and G. E. Hinton, ImageNet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems, 2012, 1097-1105. doi: 10.1145/3065386.  Google Scholar [10] LeNet-5 - A Classic CNN Architecture., Available from: https://engmrk.com/lenet-5-a-classic-cnn-architecture/., Google Scholar [11] Y. Li and Y. Liang, Learning overparameterized neural networks via stochastic gradient descent on structured data, preprint, arXiv: 1808.01204. Google Scholar [12] S. Liang, R. Sun, Y. Li and R. Srikant, Understanding the loss surface of neural networks for binary classification, preprint, arXiv: 1803.00909. Google Scholar [13] B. Neyshabur, Z. Li, S. Bhojanapalli, Y. LeCun and N. Srebro, Towards understanding the role of over-parametrization in generalization of neural networks, preprint, arXiv: 1805.12076. Google Scholar [14] Q. Nguyen, M. C. Mukkamala and M. Hein, On the loss landscape of a class of deep neural networks with no bad local valleys, preprint, arXiv: 1809.10749. Google Scholar [15] S. Ren, K. He, R. Girshick and J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, IEEE Trans. Pattern Anal. Machine Intell., 39 (2017), 1137-1149.  doi: 10.1109/TPAMI.2016.2577031.  Google Scholar [16] A. Rosebrock, LeNet - Convolutional neural network in Python, 2016., Available from: https://www.pyimagesearch.com/2016/08/01/lenet-convolutional-neural-network-in-python/. Google Scholar [17] D. Silver, A. Huang, C. J. Maddison, A. Guez and L. et al. Sifre, Mastering the game of Go with deep neural networks and tree search, Nature, 529 (2016), 484-489.  doi: 10.1038/nature16961.  Google Scholar [18] H. Wang, Y. Wang, Z. Zhou, X. Ji and D. Gong, et al., CosFace: Large margin cosine loss for deep face recognition, preprint, arXiv: 1801.09414. doi: 10.1109/CVPR.2018.00552.  Google Scholar [19] P. Yin, J. Xin and Y. Qi, Linear feature transform and enhancement of classification on deep neural network, J. Sci. Comput., 76 (2018), 1396-1406.  doi: 10.1007/s10915-018-0666-1.  Google Scholar

show all references

##### References:
 [1] Z. Allen-Zhu, Y. Li and Z. Song, A convergence theory for deep learning via over-parameterization, preprint, arXiv: 1811.03962. Google Scholar [2] A. Brutzkus and A. Globerson, Globally optimal gradient descent for a ConvNet with Gaussian inputs, preprint, arXiv: 1702.07966. Google Scholar [3] A. Brutzkus and A. Globerson, Over-parameterization improves generalization in the XOR detection problem, preprint. Google Scholar [4] A. Brutzkus, A. Globerson, E. Malach and S. Shalev-Shwartz, SGDlearns over-parameterized networks that provably generalize on linearly separable data, 6th International Conference on Learning Representations, Vancouver, BC, Canada, 2018 Google Scholar [5] R. T. des Combes, M. Pezeshki, S. Shabanian, A. Courville and Y. Bengio, Convergence properties of deep neural networks on separable data, 2019., Available from: https://openreview.net/forum?id=HJfQrs0qt7. Google Scholar [6] S. S. Du, X. Zhai, B. Poczós and A. Singh, Gradient descent provably optimizes over-parameterized neural networks, preprint, arXiv: 1810.02054. Google Scholar [7] C. Ho and S. Zimmerman, On the number of regions in an $m$-dimensional space cut by $n$ hyperplanes, Austral. Math. Soc. Gaz., 33 (2006), 260-264.   Google Scholar [8] S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Comput., 9 (1997), 1735-1780.  doi: 10.1162/neco.1997.9.8.1735.  Google Scholar [9] A. Krizhevsky, I. Sutskever and G. E. Hinton, ImageNet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems, 2012, 1097-1105. doi: 10.1145/3065386.  Google Scholar [10] LeNet-5 - A Classic CNN Architecture., Available from: https://engmrk.com/lenet-5-a-classic-cnn-architecture/., Google Scholar [11] Y. Li and Y. Liang, Learning overparameterized neural networks via stochastic gradient descent on structured data, preprint, arXiv: 1808.01204. Google Scholar [12] S. Liang, R. Sun, Y. Li and R. Srikant, Understanding the loss surface of neural networks for binary classification, preprint, arXiv: 1803.00909. Google Scholar [13] B. Neyshabur, Z. Li, S. Bhojanapalli, Y. LeCun and N. Srebro, Towards understanding the role of over-parametrization in generalization of neural networks, preprint, arXiv: 1805.12076. Google Scholar [14] Q. Nguyen, M. C. Mukkamala and M. Hein, On the loss landscape of a class of deep neural networks with no bad local valleys, preprint, arXiv: 1809.10749. Google Scholar [15] S. Ren, K. He, R. Girshick and J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, IEEE Trans. Pattern Anal. Machine Intell., 39 (2017), 1137-1149.  doi: 10.1109/TPAMI.2016.2577031.  Google Scholar [16] A. Rosebrock, LeNet - Convolutional neural network in Python, 2016., Available from: https://www.pyimagesearch.com/2016/08/01/lenet-convolutional-neural-network-in-python/. Google Scholar [17] D. Silver, A. Huang, C. J. Maddison, A. Guez and L. et al. Sifre, Mastering the game of Go with deep neural networks and tree search, Nature, 529 (2016), 484-489.  doi: 10.1038/nature16961.  Google Scholar [18] H. Wang, Y. Wang, Z. Zhou, X. Ji and D. Gong, et al., CosFace: Large margin cosine loss for deep face recognition, preprint, arXiv: 1801.09414. doi: 10.1109/CVPR.2018.00552.  Google Scholar [19] P. Yin, J. Xin and Y. Qi, Linear feature transform and enhancement of classification on deep neural network, J. Sci. Comput., 76 (2018), 1396-1406.  doi: 10.1007/s10915-018-0666-1.  Google Scholar
Geometric Condition in Lemma 5.2 ($d = 3$)
2-dim section of $\mathbb{R}^d$ spanned by $\tilde{{\mathit{\boldsymbol{w}}}}_1$ and ${\mathit{\boldsymbol{n}}}$
Number of iterations to convergence v.s. $\theta$, the anlge between subspaces $V_1$ and $V_2$
Left: convergent iterations vs. number of neurons ($d = 2$). Right: histogram of norm of weights: $\max\limits_{t}\left| {{\mathit{\boldsymbol{W}}}}^{t}\right|$ ($d = 2$ and $k = 4$)
Dynamics of weights: $\tilde{{\mathit{\boldsymbol{w}}}}_j$ and ${\mathit{\boldsymbol{u}}}_j$
Left: Slow-to-Fast transition during LeNet [16] training on MNIST dataset. Right: 2D projections of MNIST features from a trained convolutional neural network [18]. The 10 classes are color coded, the feature points cluster near linearly independent subspaces
Top row: Projections onto $\mathcal{S}^2$ (inside randomly selected 3D subspaces) of weight vectors in the first fully connected layer of a trained LeNet. Bottom row: Projections onto $\mathcal{S}^2$ (inside randomly selected 3D subspaces) of weight vectors and their convex hull in the second fully connected layer of a trained LeNet
Iterations taken ($\text{mean}\pm\text{std}$) to convergence with random and half space initializations
 # of Neurons ($2k$) Random Init. Half Space Init. 6 578.90$\pm$205.43 672.41$\pm$226.53 8 423.96$\pm$190.91 582.16$\pm$200.81 10 313.29$\pm$178.67 550.19$\pm$180.59 12 242.72$\pm$178.94 517.26$\pm$172.46 14 183.53$\pm$108.60 500.42$\pm$215.87 16 141.00$\pm$80.66 487.42$\pm$220.48 18 126.52$\pm$62.07 478.25$\pm$202.71 20 102.09$\pm$32.32 412.46$\pm$195.92 22 90.65$\pm$28.01 454.08$\pm$203.00 24 82.93$\pm$26.76 416.82$\pm$216.58
 # of Neurons ($2k$) Random Init. Half Space Init. 6 578.90$\pm$205.43 672.41$\pm$226.53 8 423.96$\pm$190.91 582.16$\pm$200.81 10 313.29$\pm$178.67 550.19$\pm$180.59 12 242.72$\pm$178.94 517.26$\pm$172.46 14 183.53$\pm$108.60 500.42$\pm$215.87 16 141.00$\pm$80.66 487.42$\pm$220.48 18 126.52$\pm$62.07 478.25$\pm$202.71 20 102.09$\pm$32.32 412.46$\pm$195.92 22 90.65$\pm$28.01 454.08$\pm$203.00 24 82.93$\pm$26.76 416.82$\pm$216.58
 [1] Lars Grüne. Computing Lyapunov functions using deep neural networks. Journal of Computational Dynamics, 2020  doi: 10.3934/jcd.2021006 [2] Leslaw Skrzypek, Yuncheng You. Feedback synchronization of FHN cellular neural networks. Discrete & Continuous Dynamical Systems - B, 2020  doi: 10.3934/dcdsb.2021001 [3] Bernold Fiedler. Global Hopf bifurcation in networks with fast feedback cycles. Discrete & Continuous Dynamical Systems - S, 2021, 14 (1) : 177-203. doi: 10.3934/dcdss.2020344 [4] Chungang Shi, Wei Wang, Dafeng Chen. Weak time discretization for slow-fast stochastic reaction-diffusion equations. Discrete & Continuous Dynamical Systems - B, 2021  doi: 10.3934/dcdsb.2021019 [5] Thomas Frenzel, Matthias Liero. Effective diffusion in thin structures via generalized gradient systems and EDP-convergence. Discrete & Continuous Dynamical Systems - S, 2021, 14 (1) : 395-425. doi: 10.3934/dcdss.2020345 [6] Hedy Attouch, Aïcha Balhag, Zaki Chbani, Hassan Riahi. Fast convex optimization via inertial dynamics combining viscous and Hessian-driven damping with time rescaling. Evolution Equations & Control Theory, 2021  doi: 10.3934/eect.2021010 [7] Xiaoxian Tang, Jie Wang. Bistability of sequestration networks. Discrete & Continuous Dynamical Systems - B, 2021, 26 (3) : 1337-1357. doi: 10.3934/dcdsb.2020165 [8] Nicholas Geneva, Nicholas Zabaras. Multi-fidelity generative deep learning turbulent flows. Foundations of Data Science, 2020, 2 (4) : 391-428. doi: 10.3934/fods.2020019 [9] Predrag S. Stanimirović, Branislav Ivanov, Haifeng Ma, Dijana Mosić. A survey of gradient methods for solving nonlinear optimization. Electronic Research Archive, 2020, 28 (4) : 1573-1624. doi: 10.3934/era.2020115 [10] Riadh Chteoui, Abdulrahman F. Aljohani, Anouar Ben Mabrouk. Classification and simulation of chaotic behaviour of the solutions of a mixed nonlinear Schrödinger system. Electronic Research Archive, , () : -. doi: 10.3934/era.2021002 [11] George W. Patrick. The geometry of convergence in numerical analysis. Journal of Computational Dynamics, 2021, 8 (1) : 33-58. doi: 10.3934/jcd.2021003 [12] Matania Ben–Artzi, Joseph Falcovitz, Jiequan Li. The convergence of the GRP scheme. Discrete & Continuous Dynamical Systems - A, 2009, 23 (1&2) : 1-27. doi: 10.3934/dcds.2009.23.1 [13] D. R. Michiel Renger, Johannes Zimmer. Orthogonality of fluxes in general nonlinear reaction networks. Discrete & Continuous Dynamical Systems - S, 2021, 14 (1) : 205-217. doi: 10.3934/dcdss.2020346 [14] Pedro Aceves-Sanchez, Benjamin Aymard, Diane Peurichard, Pol Kennel, Anne Lorsignol, Franck Plouraboué, Louis Casteilla, Pierre Degond. A new model for the emergence of blood capillary networks. Networks & Heterogeneous Media, 2020  doi: 10.3934/nhm.2021001 [15] Daniel Kressner, Jonas Latz, Stefano Massei, Elisabeth Ullmann. Certified and fast computations with shallow covariance kernels. Foundations of Data Science, 2020, 2 (4) : 487-512. doi: 10.3934/fods.2020022 [16] Hideki Murakawa. Fast reaction limit of reaction-diffusion systems. Discrete & Continuous Dynamical Systems - S, 2021, 14 (3) : 1047-1062. doi: 10.3934/dcdss.2020405 [17] Martin Heida, Stefan Neukamm, Mario Varga. Stochastic homogenization of $\Lambda$-convex gradient flows. Discrete & Continuous Dynamical Systems - S, 2021, 14 (1) : 427-453. doi: 10.3934/dcdss.2020328 [18] Gabrielle Nornberg, Delia Schiera, Boyan Sirakov. A priori estimates and multiplicity for systems of elliptic PDE with natural gradient growth. Discrete & Continuous Dynamical Systems - A, 2020, 40 (6) : 3857-3881. doi: 10.3934/dcds.2020128 [19] Kengo Nakai, Yoshitaka Saiki. Machine-learning construction of a model for a macroscopic fluid variable using the delay-coordinate of a scalar observable. Discrete & Continuous Dynamical Systems - S, 2021, 14 (3) : 1079-1092. doi: 10.3934/dcdss.2020352 [20] Björn Augner, Dieter Bothe. The fast-sorption and fast-surface-reaction limit of a heterogeneous catalysis model. Discrete & Continuous Dynamical Systems - S, 2021, 14 (2) : 533-574. doi: 10.3934/dcdss.2020406

2019 Impact Factor: 1.373