Same features, different encodings: three case studies of path dependence in grokking and learning.

Dmitry Manning-Coe; Jacopo Gliozzi; Alexander G Stapleton; Edward Hirst; Marc Klinger; Guiseppe de Tomasi; David S Berman

Same features, different encodings: three case studies of path dependence in grokking and learning.

ORAL

Abstract

Neural network training is a complicated dynamical process. Whether or not the outcome of training depends upon the learning path has deep implications for how we can understand and use neural networks. Two extremes are grokking – where a network learns after a long period of overtraining - and “steady” learning, where the training and test loss improve together. We investigate three simple tasks in which we induce both learning paths: classifying phases of the Ising model from snapshots, the modular addition problem in which grokking was first discovered, and the benchmark MNIST task.

Using techniques from interpretability and information geometry, we systematically contrast the features, encodings, and trajectories of grokking and "steady" learning. First, we find that the features learned in our example problems are the same in both paths. The features of the network trained on Ising phases in particular are very clear – the model learns to calculate the energy of a snapshot. Second, although the features are the same for both grokking and learning, the efficiency of their encodings can be dramatically different – by up to an order of magnitude. Finally, we show that the accuracy plateau in grokking is typically associated with exponential decay of the weights in the number of epochs, and that the grokking time appears to exhibit power law scaling across more than four decades of weight decay.

March 18, 2025, 12:00 PM – March 18, 2025, 12:12 PM

Presenters

Dmitry Manning-Coe

University of Illinois at Urbana-Champaign

Authors

Dmitry Manning-Coe

University of Illinois at Urbana-Champaign
Jacopo Gliozzi

University of Illinois at Urbana-Champaign
Alexander G Stapleton

Queen Mary University of London
Edward Hirst

Queen Mary University of London
Marc Klinger

University of Illinois at Urbana-Champaign
Guiseppe de Tomasi

University of Illinois Urbana-Champaign
David S Berman

Queen Mary University of London