High variance weight updates and sampling dynamics strongly affect how diffusion generative models generalize: a path-integral view
ORAL
Abstract
In general, complex interactions between task structure, network architecture, and learning rules determine what a system actually learns and how it generalizes training data. But in the context of generative models, at least two new features also substantially affect generalization, and have received little theoretical attention thus far: high variance in the learning target, and the details of how the learned model is actually used. We investigate how all of the aforementioned features affect generalization in the context of diffusion generative models, which learn to convert pure noise into samples similar to those from some training distribution (e.g., of images), and whose generalization behavior is poorly understood. Diffusion models are usually trained via a denoising score matching objective that involves a target only equal to the score function (i.e., the gradient of the log-likelihood) of the training distribution in expectation, and often with extremely high variance in 'boundary/gap' regions between training examples. Although one might expect this to cause problems, because the learned probability distribution depends nonlinearly on the learned score function through a certain stochastic process, this high variance produces a helpful gap-filling inductive bias by 'smearing' the training distribution, especially in boundary regions. In this work, we develop a mathematical theory that partly explains this 'generalization through variance' phenomenon. Our theoretical analysis exploits a physics-inspired path integral approach to compute the distributions typically learned by a few paradigmatic under- and overparameterized diffusion models. We find that, in the typical case, sampling from diffusion models involves integrating an effective stochastic differential equation (SDE) whose noise term controls the details of generalization. Moreover, in order for this noise term to be nontrivial, the number of model parameters ought to be comparable to the number of samples used during training, and noise is largest when model features take 'atypical' values (i.e., values somewhat larger than their overall standard deviation). Finally, we provide a semiclassical analysis of the effective SDE that describes key features of how diffusion models generalize.
–
Publication: Generalization through variance: how noise shapes inductive biases in diffusion models (ICLR 2025 submission)
Presenters
-
John Joseph Vastola
Harvard Medical School
Authors
-
John Joseph Vastola
Harvard Medical School