Phase diagram of training dynamics in deep neural networks: effect of learning rate, depth, and width

Dayal Singh Kalra; Maissam Barkeshli

Phase diagram of training dynamics in deep neural networks: effect of learning rate, depth, and width

ORAL

Abstract

We systematically analyze optimization dynamics in deep feed-forward neural networks (DNNs) trained with stochastic gradient descent (SGD) over long time scales and study carefully the effect of learning rate, depth, and width of the neural network. By analyzing the top eigenvalue λ_t of the Hessian of the loss, which is a proxy for sharpness, we find that the dynamics can show four distinct regimes: (i) an early time transient regime, (ii) an intermediate saturation regime, (iii) a progressive sharpening regime, and finally (iv) a late time ``edge of stability" regime. The early and intermediate regimes (i) and (ii) exhibit a rich phase diagram depending on learning rate η = c / λ₀, depth d and width w. We identify four critical values of c: c_critical, c_loss, c_sharp, c_max, which separate qualitatively distinct phenomena. In particular, we discover a regime c_critical < c < c_loss, which opens up with increasing d/w, in which the sharpness decreases significantly but without an initial increase in the loss, violating the simple picture of catapulting out of a local basin and into a wider one by traversing up a barrier. Our results have important implications for the question of how to scale learning rate as the DNN depth and width are increased in order to remain in the same phase of learning.

March 6, 2023, 7:36 PM – March 6, 2023, 7:48 PM

Presenters

Dayal Singh Kalra

University of Maryland, College Park

Authors

Dayal Singh Kalra

University of Maryland, College Park
Maissam Barkeshli

Joint Quantum Institute, NIST/University of Maryland, College Park, University of Maryland College Park, University of Maryland, College Park