APS Logo

How SGD noise affects performance in distinct regimes of deep learning

ORAL

Abstract

Understanding when the noise in stochastic gradient descent (SGD) improves generalization of neural networks remains a challenge, complicated by the fact that nets can operate in distinct training regimes. Here we study how the magnitude of this noise or `temperature' T affects performance as the scale of initialization α is varied. α is a key parameter that controls if the network is `lazy' and behaves as a kernel (α >> 1), or instead if it learns features (α << 1).

For classification of MNIST and CIFAR10 images by deep nets, we empirically observe that: (i) if α<<1, the optimal test error is achieved for a temperature value Topt ~ αk.

In the kernel regime, (ii) the relative weights variation at the end of training with respect to initialization increases as Tδ Pγ, where P is the number of training points; (iii) the training time t*, defined as the learning rate times the number of training steps required to bring a hinge loss to zero, increases as t*~T Pb ; (iv) at the cross-over temperature Tc ~ P-a the model escapes the kernel regime and its test error changes. We rationalize (i) with a scaling argument yielding k=(D-1)/(D+1), where D is the number of hidden layers of the network. We explain (ii,iii) using a perceptron architecture, for which we can compute the weights-dependent covariance of SGD noise and we obtain the exponents b, γ and δ. b and γ are found to depend on the density of data near the boundary separating labels. This model demonstrates that increasing the noise magnitude T increases the training time, leading to a larger change of the weights, allowing the model to escape the kernel regime. Therefore we rationalize (iv) with a scaling argument that relates the exponents a, γ, δ as a=γ/δ.

Presenters

  • Antonio Sclocchi

    EPFL, Ecole Polytechnique Federale de Lausanne

Authors

  • Antonio Sclocchi

    EPFL, Ecole Polytechnique Federale de Lausanne

  • Mario Geiger

    Massachusetts Institute of Technology

  • Matthieu Wyart

    Ecole Polytechnique Federale de Lausanne