Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

Darshil H Doshi; Tianyu He; Aritra Das; Andrey Gromov

Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

ORAL

Abstract

Large language models can solve tasks that were not present in the training set -- a capability is believed to be due to in-context learning and skill composition. In this work, we study the emergence of in-context learning and skill composition in a collection of modular arithmetic tasks. Specifically, we consider a finite collection of linear modular functions z = ax + by mod p labeled by the task-vector (a, b) ∈ Ζ_p². We use some of these tasks for pre-training and the rest for out-of-distribution testing. We empirically show that a GPT-style transformer exhibits a transition from in-distribution to out-of-distribution generalization as the number of pre-training tasks increases. We find that the smallest model capable of out-of-distribution generalization requires two transformer blocks, while for deeper models, the out-of-distribution generalization phase is transient, necessitating early stopping. Finally, we perform an interpretability study of the pre-trained models, revealing the highly structured representations in both phases; and discuss the learnt algorithm.

March 19, 2025, 12:12 PM – March 19, 2025, 12:24 PM

Publication: Tianyu He, Darshil Doshi, Aritra Das, Andrey Gromov; "Learning to grok: Emergence of in-context learning and skill compostion in modular arithmetic tasks"; NeurIPS 2024 (Oral)

Presenters

Darshil H Doshi

University of Maryland College Park

Authors

Darshil H Doshi

University of Maryland College Park
Tianyu He

University of Maryland College Park
Aritra Das

University of Maryland, College Park, University of Maryland College Park
Andrey Gromov

University of Maryland College Park