How SGD noise affects performance in distinct regimes of deep learning

Antonio Sclocchi; Mario Geiger; Matthieu Wyart

How SGD noise affects performance in distinct regimes of deep learning

ORAL

Abstract

Understanding when the noise in stochastic gradient descent (SGD) improves generalization of neural networks remains a challenge, complicated by the fact that nets can operate in distinct training regimes. Here we study how the magnitude of this noise or `temperature' T affects performance as the scale of initialization α is varied. α is a key parameter that controls if the network is `lazy' and behaves as a kernel (α >> 1), or instead if it learns features (α << 1).

For classification of MNIST and CIFAR10 images by deep nets, we empirically observe that: (i) if α<<1, the optimal test error is achieved for a temperature value T_opt~ α^k.

In the kernel regime, (ii) the relative weights variation at the end of training with respect to initialization increases as T^δP^γ, where P is the number of training points; (iii) the training time t^*, defined as the learning rate times the number of training steps required to bring a hinge loss to zero, increases as t^*~T P^b; (iv) at the cross-over temperature T_c~ P^-a the model escapes the kernel regime and its test error changes. We rationalize (i) with a scaling argument yielding k=(D-1)/(D+1), where D is the number of hidden layers of the network. We explain (ii,iii) using a perceptron architecture, for which we can compute the weights-dependent covariance of SGD noise and we obtain the exponents b, γ and δ. b and γ are found to depend on the density of data near the boundary separating labels. This model demonstrates that increasing the noise magnitude T increases the training time, leading to a larger change of the weights, allowing the model to escape the kernel regime. Therefore we rationalize (iv) with a scaling argument that relates the exponents a, γ, δ as a=γ/δ.

March 7, 2023, 11:48 AM – March 7, 2023, 12:00 PM

Presenters

Antonio Sclocchi

EPFL, Ecole Polytechnique Federale de Lausanne

Authors

Antonio Sclocchi

EPFL, Ecole Polytechnique Federale de Lausanne
Mario Geiger

Massachusetts Institute of Technology
Matthieu Wyart

Ecole Polytechnique Federale de Lausanne