Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks
ORAL
Abstract
Large language models can solve tasks that were not present in the training set -- a capability is believed to be due to in-context learning and skill composition. In this work, we study the emergence of in-context learning and skill composition in a collection of modular arithmetic tasks. Specifically, we consider a finite collection of linear modular functions z = ax + by mod p labeled by the task-vector (a, b) ∈ Ζp2. We use some of these tasks for pre-training and the rest for out-of-distribution testing. We empirically show that a GPT-style transformer exhibits a transition from in-distribution to out-of-distribution generalization as the number of pre-training tasks increases. We find that the smallest model capable of out-of-distribution generalization requires two transformer blocks, while for deeper models, the out-of-distribution generalization phase is transient, necessitating early stopping. Finally, we perform an interpretability study of the pre-trained models, revealing the highly structured representations in both phases; and discuss the learnt algorithm.
–
Publication: Tianyu He, Darshil Doshi, Aritra Das, Andrey Gromov; "Learning to grok: Emergence of in-context learning and skill compostion in modular arithmetic tasks"; NeurIPS 2024 (Oral)
Presenters
-
Darshil H Doshi
University of Maryland College Park
Authors
-
Darshil H Doshi
University of Maryland College Park
-
Tianyu He
University of Maryland College Park
-
Aritra Das
University of Maryland, College Park, University of Maryland College Park
-
Andrey Gromov
University of Maryland College Park