Hey commendations on sharing your update.
Another similar line of work I like is Roberts+Yaida’s “Principles of Deep Learning Theory” - this is a similar-in-spirit approach to MFT, but they perturb around a different limit and get feature-learning as a finite-width effect. I haven’t studied MFT to compare the validity of the two; my guess is MFT is the more relevant description. PDLT at least does a very good job modernizing the NTK approach and connecting to the older literature. I’m a fanboy as it was my gateway drug for learning theory lol.
Yes! I was familiar with PDLT as well, and I do think it's a similar-in-spirit approach to MFT (if not a continuation of the signal-propagation MFT work). Thanks for the pointer.
that explains why SGD on overparameterized nets generalizes
Wait, I thought the singular learning theory stuff already did this part? (Just the "why SGD on overparameterized nets generalizes" part, not the "why particular architectural choices work" or "what particular features get learned" parts.) Neural networks being singular means that the parameter–function map is not a one-to-one correspondence, which means that simpler hypotheses (those that need fewer parameters to be specified or can correct "errors" in some parameters) occupy more volume in parameter-space and are easier for SGD to find first, such that training is implicitly doing a form of minimum-description-length program induction (with the learning coefficient being the measure of complexity rather than the parameter count). Is that too "qualitative" to count as an answer (because the architecture and feature prediction parts are the true test of knowledge)?
Not quite. SLT is for a specific subcase of Bayesian learning only, not SGD. Maybe more importantly for this point, it also doesn’t really show why neural network priors are good, just that neural network priors strongly favour some solutions over others.
Some SLT-adjacent stuff is pretty strongly suggestive of a proper answer, but I don’t think there’s a proper full proof of what we want in generality written up publicly yet.
some more thoughts quickly:
sorry i'm aware this is very much not clear but making it clear would be a bunch of work and i'm not going to do it atm ↩︎
which probably isn't always. eg it's probably pretty false for the prior scaling that gives NNGP in the wide limit. a good story would be able to "see" this difference between differently scaled gaussian priors ↩︎
btw the correct meaning of simplicity in this setting is not kolmogorov complexity, but instead circuit size ↩︎
A few days ago, I reviewed a paper titled “There Will Be a Scientific Theory of Deep Learning". In it, I expressed appreciation for the authors for writing the piece, but skepticism for stronger forms of their titular claims.
Since then I’ve spoken with various past collaborators (via text and in person), and read or reread quite a few deep learning theory papers, including the bombshell Zhang et al. 2016 and Nagarajan et al. 2019 papers that I wrote about on LessWrong.
And the thing is, parts of the infinite width/depth-limit work turned out to be much more interesting than I thought it was. Perhaps I have judged deep learning theory (a bit) too harshly.
(Thanks to Dmitry Vaintrob and Kareel Hänni in particular for conversations on this topic. Much of this was in private, but was spurred on by a comment from Dmitry that can be found on LessWrong. Also thanks again to the authors of the scientific theory of deep learning paper, which provided a bunch of references to papers that I had forgotten or been previously unaware of.)
A lot of my impression for the infinite-width and depth-limit work comes from the neural tangent kernel/neural network Gaussian Process line of work. This line of work starts from Radford Neal’s 1994 paper, where he noted that an infinitely-wide single hidden-layer neural network with random weights is a Gaussian Process. In 2017/2018, this work was extended to deep neural networks; it was shown by Lee et al. that a randomly initialized deep neural network was, if you took a certain type of infinite width limit, also a Gaussian Process. This was then extended to the Neural Tangent Kernel work, which described the training dynamics of these infinitely wide neural networks, and showed that it was equivalent to kernel gradient descent with a fixed kernel (the eponymous Neural Tangent Kernel). This allowed people to derive convergence properties and nontrivial generalization bounds.
Unfortunately, while beautiful, it was definitely not how neural networks learn. In the NTK limit, the network behaves as if it were doing linear regression in a feature space whose dimension is the number of neural net parameters. Notably, there is no feature learning, and only the last layer weights are updated by a noticeable amount. Unsurprisingly, this does not describe the behavior of neural networks; small (finite width) neural networks have been shown to outperform their equivalent tangent kernels.
An alternative way of taking an infinite width limit is Mean Field Theory (MFT, applied to deep neural networks). As I understand it, the basic idea behind Mean Field Theory in physics is that, instead of calculating the interactions between many objects, you replace the many-body interactions with an average “field” that captures the overall dynamics of the system. (Hence the name.) In neural network land, it turns out that you can take a different infinite-width limit in which the empirical distribution of hidden-unit parameters, viewed as a probability measure on parameter space, evolves under a deterministic flow. This was worked out around 2018 by Mei, Montanari, and Nguyen, Chizat and Bach, Rotskoff and Vanden-Eijnden, and Sirignano and Spiliopoulos.
Notably, in this different infinite width limit, networks actually learn features. NTK uses 1/√N scaling, which makes parameters move only O(1/√N) during training: too small to change the effective kernel. Mean-field uses 1/N scaling, which lets parameters move Θ(1), so the kernel evolves and hidden representations change over the course of training. In MFT, the model is doing something other than glorified linear regression in a fixed random feature space. That being said, for a few years, MFT was entirely a theory of 2-layer neural networks, and it was genuinely unclear how to extend this to deeper networks.
As with most of the deep learning community, I was very impressed by the Tensor Program work of Greg Yang, which was an extension (though not an obvious one) of the 2-layer MFT work. Greg Yang proved a series of theorems that allowed him to create a unifying framework (abc parameterization) for deep neural networks, where NNGP/NTK and MFT were special cases of this family. Notably, this allowed him to derive μP (maximal-update parameterization) which allows hyperparameter transfer across width (though later work would extend this to depth as well). This is widely considered to be perhaps the clearest application (some would say, only clear application) of modern deep learning theory.
In my memory, I chalked this up as Greg Yang being a genius. In my recollection of the work, I remembered only μP and the toy shallow neural network model that Yang created that allows one to rederive it.
What I missed, and only learned in the past few days, is that Yang didn't invent this machinery from whole cloth.[1] There was a different line of work, done by a team at Google Brain that was confusingly also titled mean-field theory, which studied how signals traveled forward and backwards at initialization (though not the training dynamics). Two pioneering examples of this work include Poole et al.'s Exponential expressivity in deep neural networks through transient chaos and Schoenholz et al.'s Deep Information Propagation. Greg Yang's Tensor Program work descended from this line of work, and Greg Yang was a collaborator with Schoenholz and others.
Reading the work, it's clear how Yang's work draws inspiration from this signal propagation branch of MFT.[2] For example, the signal-propagation MFT work contained special cases of Greg Yang's Master Theorem, in that they both utilize the fact that at infinite width, pre-activations are Gaussian to track their evolution layer by layer via a deterministic recursion on covariances.
(My guess is the namespace collision is why I somehow missed this line of work; I had read up on the 2-layer training dynamics branch of MFT, thought I had understood the relevant parts of MFT, and missed the signal propagation branch entirely)
I still think the strong version of "there will be a scientific theory of deep learning", that explains why SGD on overparameterized nets generalizes, why particular architectural choices work, and what particular features get learned is far from established. I also think that the Zhang et al. and Nagarajan et al. results remain genuinely damning for the older PAC-Bayes / uniform-convergence approaches. I don't think anything in the MFT/TP literature addresses the core puzzles those papers raised (they address very different questions in very different regimes).
But a lot of my pessimism to deep learning theory came from feeling like there was not a coherent intellectual tradition that could point to concrete wins. Insofar as MFT (both the signal-propagation and training-dynamics branches) and Tensor Programs constitute such a tradition (as opposed to primarily the work of a single brilliant individual), then there is at least one tradition in deep learning theory that has produced cumulative progress and made falsifiable predictions that have been confirmed in practice. That deserves more credit than what I was giving the field.
Oops.
I sometimes run into bright young AI people with plenty of interest in math but not so much in engineering, who ask me what they should study. Beyond the very basics of deep learning (e.g. optimizers, basic RL theory), I used to give a shrug and say “Maybe computation in superposition? Maybe Singular Learning Theory?”. From now on I think I'll start my answer with "probably the Mean Field Theory and Tensor Programs work."
Yes, this was obvious in retrospect. As I say later in the post, oops.
Of course, there's a lot of work on initializations (E.g. the Xavier and He initializations), most of which relied on 1) tracking the forward and backwards passes, 2) heuristic calculations of the scale of various parameters and 3) an independence assumption between params and gradients at initialization, and were substantially less sophisticated than the MFT work. While the mu-Parameterization tensor program paper also provides these heuristic calculations (allowing one to rederive mu-P from a toy model), it formalized these assumptions with tools from free probability and random matrix theory.
The closest work I'm aware of that touches is Rubin, Seroussi, and Ringel's Grokking as a First Order Phase Transition in Two Layer Networks