This paper offers a fairly intuitive explanation for why flatter minima generalize better: suppose the training and testing data have distinct, but nearby, minima that minimize their respective loss. Then, the curvature around the training minima acts as the second order term in a Taylor expansion that approximates the expected test loss for models nearby the training minima.
I feel like this explanation is just restating the question. Why are the minima of the test and training data often close to each other? What makes reality be that way?
You can come up with some explanation involving mumble mumble fine-tuning, but I feel like that just leaves us where we started.
My intuition: small changes to most parameters don’t influence behavior that much, especially if you’re in a flat basin. The local region in parameter space thus contains many possible small variations in model behavior. The behavior that solves the training data is similar to that which solves the test data, due to them being drawn from the same distribution. It’s thus likely that a nearby region in parameter space is a minima for the test data.
Introduction
This is week 4 of Quintin's Alignment Papers Roundup. The current focus is the inductive biases of stochastic gradient descent.
For most datasets and labels, there are many possible models that reach good performance. "Inductive biases" refers to the various factors that incline a particular training process to find some types of models over others. When the data under-specify the learned model, a training process's inductive biases determine what sort of decision making process the model implements, and how the model generalizes beyond its training data.
I'd intended to publish this last week, but it turns out that there's a lot of work on SGD's inductive biases, and it's very technical. I kept finding new papers that seemed relevant. That's why this roundup has 16 papers, in place of the usual ~9 or so.
Papers
Eigenspace Restructuring: a Principle of Space and Frequency in Neural Networks
My opinion:
The NTK lets us directly compute the inductive biases of a neural network near a particular point in parameter space. The NTK's eigenfunctions give us possible behaviors, and its spectrum tells us how easy it is for the network to learn each eigenfunction.
This paper uses the NTK to compare the architectural inductive biases of convolutional networks to those of multilayer perceptrons. It seems like a promising approach for better understanding what sorts of behaviors different architectures are inclined to learn. However, this paper makes two major simplifying assumptions:
This paper's discussion of inductive biases focuses a lot on the frequency biases of neural networks, rather than things more related to alignment. It is very hard to use the NTK (or any mathematical formalism) to talk about inductive biases towards or away from "intentional" / high-level concepts, such as values, deception, corrigibility, etc. However, it is much easier to evaluate how different NTKs bias the network towards learning functions of different frequencies, and so much discussion of NTK inductive biases focuses on frequency.
For a freshly initialized network, the NTK is going to give you a list of functions such as these:
and tell you how easily the network can learn each of them. It's easy to rank these functions by frequency, but not so easy to rank them by how much learning them inclines the network towards liking geese.
Implicit Regularization via Neural Feature Alignment
My opinion:
(see below)
What can linearized neural networks actually say about generalization?
My opinion:
A natural question to ask is how to extend approaches like Eigenspace Restructuring to track an architecture's inductive biases across an entire trajectory of neural net optimization. Ideally, we'd have a theoretical model of how the NTK and its inductive biases change over the course of training, then "integrate" over the that trajectory to fully account for the functions learned over the training process.
The two papers above do not do this. Instead, they empirically investigate how the NTK changes over the course of network training, and how those changes impact our ability to predict training dynamics and generalization, finding that the NTK adapts over time to align with the labeling of the training data.
While not as useful as a theoretical model of how the NTK changes over training, these empirical results still seem alignment relevant. E.g., they imply that inductive biases can be learned from the training labels, which matches findings that humans become more shape biased as they grow up, and switch between shape or texture bias depending on what they're looking at (e.g., being shape biased for animals, but texture biased for liquids / pastes). See section 1.1 of this dissertation.
Having context-sensitive inductive biases seems very useful if you want to quickly adapt to new information. Using different inductive biases for different (learned) object classes seems ~impossible to encode in an architecture or learning process, so I think it would have to be learned from the training data. Probably, many of the inductive biases of humans and AIs come from complex interactions between architecture, training process and data, and are far outside of the constant NTK limit.
We've also seen a similar result from the other direction: meta learning uses a two-level optimization setup, where the outer optimizer uses second-order gradients to learn an initialization that the inner optimizer can quickly adapt to downstream tasks. However, this paper found that the outer optimizer mostly just learns high-performance features directly.
Also, if SGD learns high-performance inductive biases for its training data, that could explain why explicit meta learning / self modifying training processes don't seem to outperform simple SGD: gradient descent already self modifies to become better at learning the task at hand.
Tuning Frequency Bias in Neural Network Training with Nonuniform Data
My opinion:
The previous two papers investigate how the NTK evolves while training finite width networks (when Eigenspace Restructuring's assumption 1 is violated). This paper develops methods to apply NTK analysis to situations where data are not uniformly distributed (when Eigenspace Restructuring's assumption 2 is violated). They find they can control the degree of frequency bias through the loss function, which further underscores how tightly intertwined a model's inductive biases are with its training data.
On the Activation Function Dependence of the Spectral Bias of Neural Networks
My opinion:
So, it turns out you can just remove the frequency bias of deep networks, and they will still work well for some tasks (or even perform better). I was surprised by this. My impression had been that the generalization capacity of neural networks would be more sensitive to their inductive biases than that.
I really wish the authors had tested their non-frequency-biased networks on a more realistic problem, ideally language modeling on the scale of BERT or larger (it's not that expensive!). It'd be interested to see if we could find systematic differences in the generalization behaviors of language models trained with frequency bias versus language models trained without frequency bias.
I also wonder how closely the post-training inductive biases of the two types of models would line up. Can enough data "wash out" the differences in architectural inductive biases?
Spectral Bias in Practice: The Role of Function Frequency in Generalization
My opinion:
This paper describes practical methods for evaluating the frequency sensitivity of neural networks, and how this sensitivity varies across the network's input space. It seems useful for investigating how interventions on network inductive biases impact post-training behaviors (at least, for behaviors related to frequency).
I'd be interested to see how trained networks without an initial frequency bias (see previous paper) compare to those with an initial frequency bias.
Limitations of the NTK for Understanding Generalization in Deep Learning
My opinion:
This paper illustrates an important limitation of using empirical estimations of the NTK at specific points in training to track inductive biases. There are important learning dynamics that only appear in aggregate across many SGD steps, including scaling laws apparently. Presumably, a proper theory of NTK evolution over time would let us predict the actual scaling behavior of architectures.
Concluding thoughts about NTK-based accounts of inductive biases:
I think the real bottleneck on using the NTK in alignment is the difficulty of expressing alignment-relevant behaviors (deception, powerseeking, etc) in terms of the inductive biases described by the NTK.
If we can translate from NTK inductive biases to alignment-relevant behaviors, I think we'd be able to use empirically estimated NTKs at points across the network's training trajectory to get useful estimates of how inclined the network is towards learning those behaviors (rather than needing a theoretical understanding of NTK evolution).
In particular, What can linearized neural networks actually say about generalization? indicates that the NTK can rank the relative learnability of different tasks, even while providing an overall poor estimation of the network's capabilities. So even noisy estimates from the NTK may suffice to determine whether models end up deceptive or powerseeking.
For translating from NTK inductive biases to alignment-relevant behaviors, I think our best bet is to study the NTKs of pretrained LMs. Probably, their NTKs have been restructured to make semantically meaningful behaviors more learnable. I expect it's easier to relate such inductive biases to the behaviors we're interested in.
E.g., the Implicit Regularization paper trained on a toy problem of determining whether points were in a disk of radius √2π centered at the origin. Figure 1 shows that the resulting inductive biases align with the task after training:
Of course, empirically estimating the inductive biases of a language model's NTK after training is going to be very difficult. I've not found any papers which attempt such a feat.
At this point in the roundup, we're moving on from architecture-entangled inductive biases / the NTK, and looking into the inductive biases of SGD itself.
Shift-Curvature, SGD, and Generalization
My opinion:
This paper offers a fairly intuitive explanation for why flatter minima generalize better: suppose the training and testing data have distinct, but nearby, minima that minimize their respective loss. Then, the curvature around the training minima acts as the second order term in a Taylor expansion that approximates the expected test loss for models nearby the training minima.
The paper then investigates the impact of gradient noise from SGD and find that it biases models towards flatter regions of parameter space, even to the point of getting worse training loss.
Implicit Gradient Regularization
My opinion:
Even with full-batch gradient descent (so no gradient noise from SGD), it turns out that gradient descent's discrete steps introduce an inductive bias into network training, and that we can analyze this bias with surprisingly straightforward methods. Like the noise from SGD, this inductive bias also pushes the model towards flatter regions of parameter space.
Limiting Dynamics of SGD: Modified Loss, Phase Space Oscillations, and Anomalous Diffusion
My opinion:
(see below)
Multiplicative noise and heavy tails in stochastic optimization
My opinion:
These papers dive more deeply into the structure and effects of gradient noise, modeling SGD as a diffusion process while making different assumptions about the structure of the gradient noise. They point to a picture where the ratio of learning rate to batch size controls a sort of exploration bias of SGD towards broader basins.
One interesting thing to note about these inductive biases from the optimizer is that the human brain probably has very similar inductive biases. E.g., the inductive bias found in Implicit Gradient Regularization happens because SGD does not take optimally sized steps for reducing training loss at each update. It seems very unlikely that brain neurons do make optimal updates to minimize predictive error, which probably leads the brain to also steer towards flatter regions of its parameter space.
Similarly, the brain's optimization process seems pretty noisy (and has batch size one), so the brain probably also mirrors the inductive biases that come from the noise in SGD updates.
Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM in Deep Learning
My opinion:
This is a very cool paper. It offers a (pretty plausible, IMO) account of how and why the two optimizers have different inductive biases. I think it's a very good sign that we know enough about gradient descent that we can perform these sorts of analyses.
The Low-Rank Simplicity Bias in Deep Networks
My opinion:
The core intuition of this paper is that, when you multiply a bunch of matrices together, the rank of the composite operator is no higher than that of the lowest rank component matrix. Models that operate by multiply matrices together are thus biased towards implementing low-rank functions. Of course, adding a residual connection will control this bias, and means information flow is no longer bottlenecked by the lowest rank matrix in the model.
On the Implicit Bias Towards Minimal Depth of Deep Neural Networks
My opinion:
This paper reflects my intuition that gradient descent is biased towards short paths, possibly because longer paths lose rank too quickly?
Why neural networks find simple solutions: the many regularizers of geometric complexity
My opinion:
I think that something like geometric simplicity bias is at the core of how neural networks learn general solutions. Neural networks mostly seem to model the union of low dimensional manifolds on which their input data lie, then sort of extrapolate the geometry of those manifolds to unseen data. Sudden deviations from the manifold geometry would lead to higher geometric complexity. Learning processes biased towards low geometric complexity tend not to have such deviations.
The Pitfalls of Simplicity Bias in Neural Networks
My opinion:
This paper shows just how strong neural network simplicity biases are, and also gives some intuition for how the simplicity bias of neural networks is different from something like a circuit simplicity bias or Kolmogorov simplicity bias. E.g., neural networks don't seem all that opposed to memorization. The paper shows examples of neural networks learning a simple linear feature which imperfectly classifies the data, then memorizing the remaining noise, despite there being a slightly more complex feature which perfectly classifies the training data (and I've checked, there's no grokking phase transition, even after 2.5 million optimization steps with weight decay).
It also shows how, depending on the data you're trying to model, a simplicity bias may actually harm generalization.
Conclusion
My biggest takeaway from this review is that SGD has a lot of inductive biases. Even something as simple as the fact that SGD takes discrete, non-optimal update steps leads to systematic bias in the sorts of solutions found. Probably, there are lots of other inductive biases coming from interactions between architecture, data and optimizer.
Also, inductive bias research is making a lot of progress. In particular, the NTK perspective on inductive bias seems to be quickly moving in a potentially valuable direction. If we can reach an okayish understanding of how the NTK evolves over training, and how the inductive biases supplied by the NTK relate to high-level cognitive properties, that might give us something like a non-closed-form account of path-dependent inductive biases.
I've also updated towards humans and AIs having similar inductive biases. There are some inductive biases that I think we straight up share with AIs, such as those that come from making non-optimal / noisy parameter updates. I also think that humans have a fair bit of geometric simplicity bias, as indicated by the fact that most small perturbations to our visual / auditory inputs do not have very large impacts on how we process those inputs.
I hope readers find these papers useful for their own research. Please feel free to discuss the listed papers in the comments or recommend additional papers to me.
Honorable mentions
These are interesting papers that are related to inductive biases, but which I decided not to include in the roundup, both because I didn't want to make the post too long, and because I've delayed the post long enough already.
Future
For next week's roundup, I'm thinking the focus will be on techniques for chain of thought language models.
My other candidate focuses are:
Let me know if there are any topics you're particularly interested in.