In this paper, we propose a geometric framework to analyze the convergence properties of gradient descent trajectories in the context of linear neural networks. We translate a well-known empirical observation of linear neural nets into a conjecture that we call the overfitting conjecture which states that, for almost all training data and initial conditions, the trajectory of the corresponding gradient descent system converges to a global minimum. This would imply that the solution achieved by vanilla gradient descent algorithms is equivalent to that of the least-squares estimation, for linear neural networks of an arbitrary number of hidden layers. Built upon a key invariance property induced by the network structure, we first establish convergence of gradient descent trajectories to critical points of the square loss function in the case of linear networks of arbitrary depth. Our second result is the proof of the overfitting conjecture in the case of single-hidden-layer linear networks with an argument based on the notion of normal hyperbolicity and under a generic property on the training data (i.e., holding for almost all training data).
Citation: |
[1] |
P.-A. Absil, R. Mahony and B. Andrews, Convergence of the iterates of descent methods for analytic cost functions, SIAM J. Optim., 16 (2005), 531-547.
doi: 10.1137/040605266.![]() ![]() ![]() |
[2] |
Z. Allen-Zhu, Y. Li and Z. Song, A convergence theory for deep learning via over-parameterization, In Proceedings of the 36th International Conference on Machine Learning; Proceedings of Machine Learning Research, PMLR, 09–15, 97 (2019), 242–252.
![]() |
[3] |
S. Arora, N. Cohen and E. Hazan, On the optimization of deep networks: Implicit acceleration by overparameterization, In Proceedings of the 35th International Conference on Machine Learning; Proceedings of Machine Learning Research, PMLR, 10–15, 80 (2018), 244–253.
![]() |
[4] |
P. Baldi and K. Hornik, Neural networks and principal component analysis: Learning from examples without local minima, Neural Networks, 2 (1989), 53-58.
![]() |
[5] |
A. Blum and R. L. Rivest, Training a 3-node neural network is NP-complete, In Proceedings of the 1988 Workshop on Computational Learning Theory (Cambridge, MA, 1988), (1989), 9–18.
![]() ![]() |
[6] |
A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous and Y. LeCun, The loss surfaces of multilayer networks, In Artificial Intelligence and Statistics, (2015), 192–204.
![]() |
[7] |
F. H. Clarke, Y. S. Ledyaev, R. J. Stern and P. R. Wolenski, Nonsmooth Analysis and Control Theory, Springer-Verlag, New York, 1998.
![]() ![]() |
[8] |
R. Collobert and J. Weston, A unified architecture for natural language processing: Deep neural networks with multitask learning, In Proceedings of the 25th International Conference on Machine Learning, ACM, (2008), 160–167.
doi: 10.1145/1390156.1390177.![]() ![]() |
[9] |
D. D'Acunto and K. Kurdyka, Explicit bounds for the Łojasiewicz exponent in the gradient inequality for polynomials, Ann. Polon. Math., 87 (2005), 51-61.
doi: 10.4064/ap87-0-5.![]() ![]() ![]() |
[10] |
S. S. Du, W. Hu and J. D. Lee, Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced, In Advances in Neural Information Processing Systems, (2018), 382–393.
![]() |
[11] |
S. S. Du, X. Zhai, B. Poczos and A. Singh, Gradient descent provably optimizes over-parameterized neural networks, arXiv preprint, arXiv: 1810.02054, 2018.
![]() |
[12] |
F. E. Harrell Jr, Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis, Springer, 2015.
![]() |
[13] |
M. W. Hirsch, C. C. Pugh and M. Shub, Invariant Manifolds, Lecture Notes in Mathematics, Vol. 583. Springer-Verlag, Berlin-New York, 1977.
![]() ![]() |
[14] |
R. A. Horn and C. R. Johnson, Matrix Analysis, Cambridge university press, 1990.
![]() ![]() |
[15] |
C. Jin, R. Ge, P. Netrapalli, S. M. Kakade and M. I. Jordan, How to escape saddle points efficiently, In Proceedings of the 34th International Conference on Machine Learning, 70 (2017), 1724-1732.
![]() |
[16] |
K. Kawaguchi, Deep learning without poor local minima, In Advances in Neural Information Processing Systems, (2016), 586–594.
![]() |
[17] |
D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint, arXiv: 1412.6980, 2014.
![]() |
[18] |
A. Krizhevsky, I. Sutskever and G. E. Hinton, Imagenet classification with deep convolutional neural networks, Communications of the ACM, 60 (2017), 84-90.
doi: 10.1145/3065386.![]() ![]() |
[19] |
T. Laurent and J. Brecht, Deep linear networks with arbitrary loss: All local minima are global, In International Conference on Machine Learning, (2018), 2908–2913.
![]() |
[20] |
J. D. Lee, I. Panageas, G. Piliouras, M. Simchowitz, M. I. Jordan and B. Recht, First-order methods almost always avoid strict saddle points, Math. Program., 176 (2019), 311-337.
doi: 10.1007/s10107-019-01374-3.![]() ![]() ![]() |
[21] |
S. Łojasiewicz, Sur les trajectoires du gradient d'une fonction analytique, Geometry Seminars, 1982/83 (1984), 115-117.
![]() ![]() |
[22] |
A. Mohamed, G. E. Dahl and G. Hinton, Acoustic modeling using deep belief networks, IEEE Transactions on Audio, Speech, and Language Processing, 20 (2012), 14-22.
doi: 10.1109/TASL.2011.2109382.![]() ![]() |
[23] |
K. G. Murty and S. N. Kabadi, Some NP-complete problems in quadratic and nonlinear programming, Math. Programming, 39 (1987), 117-129.
doi: 10.1007/BF02592948.![]() ![]() ![]() |
[24] |
I. Panageas and G. Piliouras, Gradient descent only converges to minimizers: Non-isolated critical points and invariant regions, In 8th Innovations in Theoretical Computer Science Conference (ITCS 2017), Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 67 (2017), Art. No. 2, 12 pp.
![]() ![]() |
[25] |
Y. B. Pesin, Lectures on Partial Hyperbolicity and Stable Ergodicity, European Mathematical Society, 2004.
doi: 10.4171/003.![]() ![]() ![]() |
[26] |
N. Qian, On the momentum term in gradient descent learning algorithms, Neural Networks, 12 (1999), 145-151.
doi: 10.1016/S0893-6080(98)00116-6.![]() ![]() |
[27] |
A. M. Saxe, J. L. McClelland and S. Ganguli, Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, arXiv preprint, arXiv: 1312.6120, 2013.
![]() |
[28] |
J. Sun, Q. Qu and J. Wright, When are nonconvex problems not scary?, arXiv preprint, arXiv: 1510.06096, 2015.
![]() |
[29] |
G. Teschl, Ordinary Differential Equations and Dynamical Systems, volume 140, American Mathematical Society Providence, 2012.
doi: 10.1090/gsm/140.![]() ![]() ![]() |
[30] |
W. I. Zangwill, Convergence conditions for nonlinear programming algorithms, Management Sci., 16 (1969/70), 1-13.
doi: 10.1287/mnsc.16.1.1.![]() ![]() ![]() |
Illustration of a
A geometric "vision" of the loss landscape
Visual representation of normal hyperbolicity
Visual representation of normal hyperbolicity and trajectories samples (in green)