How neural nets compress invariant manifolds
ORAL
Abstract
The success of neural networks is often attributed to their ability to learn relevant features from data while becoming insensitive to invariants by compressing them.
We study how neural networks compress uninformative input space in models where data lie in d dimensions, but whose label only vary within a linear manifold of dimension dp < d. We show that for a one-hidden layer network initialized with infinitesimal weights (i.e. in the feature learning regime) trained with gradient descent, the uninformative space is compressed by a factor √p, where p is the size of the training set. For large initialization of the weights (the lazy training regime), no compression occurs. We quantify the benefit of such compression on the test error ε and find that it improves the learning curves ε∼p-β - i.e. βFeature>βLazy.
Next, we show that compression shapes the Neural Tangent Kernel (NTK) evolution in time so that its top eigenvectors become more informative and display a larger projection on the labels. Consequently, kernel learning with the frozen NTK at the end of training outperforms the initial NTK.
We confirm these predictions both for a one-hidden layer FC network trained on a stripe model - boundaries are parallel interfaces (dp=1) - and for a 16-layers CNN trained on MNIST.
We study how neural networks compress uninformative input space in models where data lie in d dimensions, but whose label only vary within a linear manifold of dimension dp < d. We show that for a one-hidden layer network initialized with infinitesimal weights (i.e. in the feature learning regime) trained with gradient descent, the uninformative space is compressed by a factor √p, where p is the size of the training set. For large initialization of the weights (the lazy training regime), no compression occurs. We quantify the benefit of such compression on the test error ε and find that it improves the learning curves ε∼p-β - i.e. βFeature>βLazy.
Next, we show that compression shapes the Neural Tangent Kernel (NTK) evolution in time so that its top eigenvectors become more informative and display a larger projection on the labels. Consequently, kernel learning with the frozen NTK at the end of training outperforms the initial NTK.
We confirm these predictions both for a one-hidden layer FC network trained on a stripe model - boundaries are parallel interfaces (dp=1) - and for a 16-layers CNN trained on MNIST.
–
Presenters
-
Leonardo Petrini
Ecole Polytechnique Federale de Lausanne
Authors
-
Jonas Paccolat
Ecole Polytechnique Federale de Lausanne
-
Leonardo Petrini
Ecole Polytechnique Federale de Lausanne
-
Mario Geiger
École polytechnique fédérale de Lausanne, Ecole Polytechnique Federale de Lausanne
-
Kevin Tyloo
Ecole Polytechnique Federale de Lausanne
-
Matthieu Wyart
Physics of Complex Systems Laboratory, Institute of Physics, École Polytechnique Fédérale de Lausanne, Institute of Physics, Ecole Polytechnique Federale de Lausanne, CH-1015 Lausanne, Switzerland, EPFL, Ecole Polytechnique Federale de Lausanne