A simple model for Grokking modular arithmetic
ORAL
Abstract
Grokking is a sudden onset of generalization following a long period of overfitting. This effect was first discovered empirically on datasets generated by a discrete rule such as the multiplication table for finite groups.
In this talk I will present a simple neural network that groks a variety of modular arithmetic tasks. The network consists of a single hidden layer and a quadratic activation function (which can be replaced with more popular activation functions if so desired). I will show that (i) the model exhibits grokking on modular arithmetic tasks under vanilla gradient descent, MSE loss function, and in the absence of any regularization; (ii) grokking corresponds to learning very specific features whose structure is determined by the modular arithmetic task at hand; (iii) I will provide an analytic expression for the weights that solve modular addition problem and are found by gradient descent thereby establishing complete interpretability of the algorithm learnt by the network.
–
Presenters
-
Andrey Gromov
University of Maryland, College Park
Authors
-
Andrey Gromov
University of Maryland, College Park