Same features, different encodings: three case studies of path dependence in grokking and learning.
ORAL
Abstract
Neural network training is a complicated dynamical process. Whether or not the outcome of training depends upon the learning path has deep implications for how we can understand and use neural networks. Two extremes are grokking – where a network learns after a long period of overtraining - and “steady” learning, where the training and test loss improve together. We investigate three simple tasks in which we induce both learning paths: classifying phases of the Ising model from snapshots, the modular addition problem in which grokking was first discovered, and the benchmark MNIST task.
Using techniques from interpretability and information geometry, we systematically contrast the features, encodings, and trajectories of grokking and "steady" learning. First, we find that the features learned in our example problems are the same in both paths. The features of the network trained on Ising phases in particular are very clear – the model learns to calculate the energy of a snapshot. Second, although the features are the same for both grokking and learning, the efficiency of their encodings can be dramatically different – by up to an order of magnitude. Finally, we show that the accuracy plateau in grokking is typically associated with exponential decay of the weights in the number of epochs, and that the grokking time appears to exhibit power law scaling across more than four decades of weight decay.
Using techniques from interpretability and information geometry, we systematically contrast the features, encodings, and trajectories of grokking and "steady" learning. First, we find that the features learned in our example problems are the same in both paths. The features of the network trained on Ising phases in particular are very clear – the model learns to calculate the energy of a snapshot. Second, although the features are the same for both grokking and learning, the efficiency of their encodings can be dramatically different – by up to an order of magnitude. Finally, we show that the accuracy plateau in grokking is typically associated with exponential decay of the weights in the number of epochs, and that the grokking time appears to exhibit power law scaling across more than four decades of weight decay.
–
Presenters
-
Dmitry Manning-Coe
University of Illinois at Urbana-Champaign
Authors
-
Dmitry Manning-Coe
University of Illinois at Urbana-Champaign
-
Jacopo Gliozzi
University of Illinois at Urbana-Champaign
-
Alexander G Stapleton
Queen Mary University of London
-
Edward Hirst
Queen Mary University of London
-
Marc Klinger
University of Illinois at Urbana-Champaign
-
Guiseppe de Tomasi
University of Illinois Urbana-Champaign
-
David S Berman
Queen Mary University of London