Jacobians in Deep Neural Networks : Criticality and beyond
ORAL
Abstract
Good parameter-Initialization is crucial for training Deep Neural Networks. Correct initialization ensures that the network function and gradients are well-behaved with depth. The conditions for such an initialization, known as “criticality”, help us select hyperparameters of the network.
Jacobians between layer-outputs of the network are essential for this purpose. The norm of the Jacobian is useful for identifying critical initialization for networks. On the other hand, the spectrum of the Jacobian matrix contains information about the fluctuations in the gradients. These fluctuations play an important role in very deep networks.
I will begin my talk by formulating criticality in terms of Jacobians-norm. Using this formulation, I will show that it is possible to design networks that are “everywhere-critical”, i.e. critical irrespective of the choice of initialization; by incorporating LayerNorm/BatchNorm and residual connections. Then I will discuss modern architectures that utilize this combination; followed by experimental results that demonstrate the effect of criticality on training. Using Jacobian-spectrum, I will derive additional constraints on hyperparameters in the aforementioned everywhere-critical case.
Jacobians between layer-outputs of the network are essential for this purpose. The norm of the Jacobian is useful for identifying critical initialization for networks. On the other hand, the spectrum of the Jacobian matrix contains information about the fluctuations in the gradients. These fluctuations play an important role in very deep networks.
I will begin my talk by formulating criticality in terms of Jacobians-norm. Using this formulation, I will show that it is possible to design networks that are “everywhere-critical”, i.e. critical irrespective of the choice of initialization; by incorporating LayerNorm/BatchNorm and residual connections. Then I will discuss modern architectures that utilize this combination; followed by experimental results that demonstrate the effect of criticality on training. Using Jacobian-spectrum, I will derive additional constraints on hyperparameters in the aforementioned everywhere-critical case.
–
Publication: Doshi, D., He, T. and Gromov, A. "Critical Initialization of Wide and Deep Neural Networks through Partial Jacobians: General Theory and Applications". arXiv:2111.12143v3
Presenters
-
Darshil H Doshi
University of Maryland, College Park
Authors
-
Darshil H Doshi
University of Maryland, College Park
-
Tianyu He
University of Maryland, College Park
-
Andrey Gromov
University of Maryland, College Park