Gradient Descent Happens in a Tiny Subspace
arXiv (Cornell University)2018
Citations Over Time
Abstract
We show that in a variety of large-scale deep learning scenarios the gradient dynamically converges to a very small subspace after a short period of training. The subspace is spanned by a few top eigenvectors of the Hessian (equal to the number of classes in the dataset), and is mostly preserved over long periods of training. A simple argument then suggests that gradient descent may happen mostly in this subspace. We give an example of this effect in a solvable model of classification, and we comment on possible implications for optimization and learning.
Related Papers
- → Higher-accuracy schemes for approximating the Hessian from electronic structure calculations in chemical dynamics simulations(2010)39 cited
- → Structure and Efficient Hessian Calculation(1998)11 cited
- → An Evaluation of Parallel Numerical Hessian Calculations(2010)2 cited
- → Second-order adjoint sensitivities(2017)
- → On the connection between WRI and FWI: Analysis of the nonlinear term in the Hessian matrix(2022)