Exploring the Relationship Between SGD Noise, Hessian Structure, and Neuron Functionality in Artificial Neural Networks
ORAL
Abstract
The artificial neural networks (ANN) exhibit remarkable generalizability. During training, some neurons undergo significant transformations, acquiring distinct functional roles. The training process is typically driven by the prevalent optimizer, stochastic gradient descend (SGD). Empirical studies suggest that the noise structure inherent to SGD strongly correlates with the local Hessian of the loss landscape—a relationship that plays a critical role in finding solutions that generalize well. Beyond guiding the optimization process, this relationship also interacts with other intrinsic properties of the network. The number of relative sharp eigendirections of the Hessian as well as the activated neurons also increase as the climbing complexity of the data. Besides, the permutation symmetry of neurons within each hidden layer allows for multiple equivalent configurations of the network parameters. As training progresses, this symmetry permits further refinement in how neurons align their functional roles. Consequently, the maximum cosine similarity after permutation between weight vectors of any pair of parallelly trained ANNs can be enhanced, indicating improved structural alignment after training. In this talk, we will discuss 1) how the noise covariance relates to the hessian through its dependence on the hessians of individual sample loss, then move on to 2) the behind mechanism of the enhancement of maximum cosine similarity and its relation to the architecture of the network and the complexity of the data.
–
Presenters
-
Yikuan Zhang
Peking Univ
Authors
-
Yikuan Zhang
Peking Univ
-
Ning Yang
Peking Univ
-
Qi Ouyang
Peking Univ
-
Yuhai Tu
IBM Thomas J. Watson Research Center