Phase diagram of training dynamics in deep neural networks: effect of learning rate, depth, and width
ORAL
Abstract
We systematically analyze optimization dynamics in deep feed-forward neural networks (DNNs) trained with stochastic gradient descent (SGD) over long time scales and study carefully the effect of learning rate, depth, and width of the neural network. By analyzing the top eigenvalue λt of the Hessian of the loss, which is a proxy for sharpness, we find that the dynamics can show four distinct regimes: (i) an early time transient regime, (ii) an intermediate saturation regime, (iii) a progressive sharpening regime, and finally (iv) a late time ``edge of stability" regime. The early and intermediate regimes (i) and (ii) exhibit a rich phase diagram depending on learning rate η = c / λ0, depth d and width w. We identify four critical values of c: ccritical, closs, csharp, cmax, which separate qualitatively distinct phenomena. In particular, we discover a regime ccritical < c < closs, which opens up with increasing d/w, in which the sharpness decreases significantly but without an initial increase in the loss, violating the simple picture of catapulting out of a local basin and into a wider one by traversing up a barrier. Our results have important implications for the question of how to scale learning rate as the DNN depth and width are increased in order to remain in the same phase of learning.
–
Presenters
-
Dayal Singh Kalra
University of Maryland, College Park
Authors
-
Dayal Singh Kalra
University of Maryland, College Park
-
Maissam Barkeshli
Joint Quantum Institute, NIST/University of Maryland, College Park, University of Maryland College Park, University of Maryland, College Park