How SGD noise affects performance in distinct regimes of deep learning
ORAL
Abstract
Understanding when the noise in stochastic gradient descent (SGD) improves generalization of neural networks remains a challenge, complicated by the fact that nets can operate in distinct training regimes. Here we study how the magnitude of this noise or `temperature' T affects performance as the scale of initialization α is varied. α is a key parameter that controls if the network is `lazy' and behaves as a kernel (α >> 1), or instead if it learns features (α << 1).
For classification of MNIST and CIFAR10 images by deep nets, we empirically observe that: (i) if α<<1, the optimal test error is achieved for a temperature value Topt ~ αk.
In the kernel regime, (ii) the relative weights variation at the end of training with respect to initialization increases as Tδ Pγ, where P is the number of training points; (iii) the training time t*, defined as the learning rate times the number of training steps required to bring a hinge loss to zero, increases as t*~T Pb ; (iv) at the cross-over temperature Tc ~ P-a the model escapes the kernel regime and its test error changes. We rationalize (i) with a scaling argument yielding k=(D-1)/(D+1), where D is the number of hidden layers of the network. We explain (ii,iii) using a perceptron architecture, for which we can compute the weights-dependent covariance of SGD noise and we obtain the exponents b, γ and δ. b and γ are found to depend on the density of data near the boundary separating labels. This model demonstrates that increasing the noise magnitude T increases the training time, leading to a larger change of the weights, allowing the model to escape the kernel regime. Therefore we rationalize (iv) with a scaling argument that relates the exponents a, γ, δ as a=γ/δ.
For classification of MNIST and CIFAR10 images by deep nets, we empirically observe that: (i) if α<<1, the optimal test error is achieved for a temperature value Topt ~ αk.
In the kernel regime, (ii) the relative weights variation at the end of training with respect to initialization increases as Tδ Pγ, where P is the number of training points; (iii) the training time t*, defined as the learning rate times the number of training steps required to bring a hinge loss to zero, increases as t*~T Pb ; (iv) at the cross-over temperature Tc ~ P-a the model escapes the kernel regime and its test error changes. We rationalize (i) with a scaling argument yielding k=(D-1)/(D+1), where D is the number of hidden layers of the network. We explain (ii,iii) using a perceptron architecture, for which we can compute the weights-dependent covariance of SGD noise and we obtain the exponents b, γ and δ. b and γ are found to depend on the density of data near the boundary separating labels. This model demonstrates that increasing the noise magnitude T increases the training time, leading to a larger change of the weights, allowing the model to escape the kernel regime. Therefore we rationalize (iv) with a scaling argument that relates the exponents a, γ, δ as a=γ/δ.
–
Presenters
-
Antonio Sclocchi
EPFL, Ecole Polytechnique Federale de Lausanne
Authors
-
Antonio Sclocchi
EPFL, Ecole Polytechnique Federale de Lausanne
-
Mario Geiger
Massachusetts Institute of Technology
-
Matthieu Wyart
Ecole Polytechnique Federale de Lausanne