Variational gradient descent: enhancing generalization with automatically learned landscape-dependent noise.
ORAL · Invited
Abstract
One of the core challenges in deep learning is finding generalizable solutions. The loss landscape in an overparameterized neural network model contains many global minima with identical loss, but these can vary widely in their ability to generalize to unseen data or tasks. Landscape characteristics, such as local flatness near a solution, have been demonstrated to lead to improved generalization and stochastic learning algorithms aim to improve the odds of finding these generalizable solutions. Stochastic gradient descent, for example, achieves this goal by introducing implicit regularization that favors flat solutions via landscape-dependent noise arising from training on batches. We propose a complementary algorithm, called variational gradient descent (VGD), which introduces noise into learning dynamics by random variation of the network weights. Due to an exact activity-weight duality in feedforward network layers, this weight noise introduces a similar effective regularization to SGD. We also demonstrate that the variance and correlations in weight fluctuations can be treated as learned quantities. Thus, the weight noise is automatically tuned based on the local structure of the loss landscape, which in turn allows efficient learning of a low-rank approximation to the Hessian of the loss. By combining stochastic physics modeling of learning dynamics and applications to classification problems, we demonstrate that VGD can accelerating learning trajectories over barriers in the loss landscape, improving convergence and generalization. The VGD learning rule can be flexibly combined with SGD, dropout, and weight decay and hence provides a new avenue for improvements to generalization in deep learning.
–
Presenters
-
David Hathcock
IBM Thomas J. Watson Research Center
Authors
-
David Hathcock
IBM Thomas J. Watson Research Center
-
Yuhai Tu
IBM Thomas J. Watson Research Center