I research ways in which the tools and perspective from theoretical physics can be applied to artificial intelligence. In particular, I co-authored of The Principles of Deep Learning Theory with Sho Yaida (also based on research in collaboration with Boris Hanin), which will be published by Cambridge University Press in early 2022. (https://deeplearningtheory.com)
I am a Research Affiliate at the Center for Theoretical Physics at MIT and also a Principal Researcher at Salesforce, having arrived via acquisition of Diffeo where I was Co-Founder and Chief Technology Officer. I am also an Affiliate of the NSF AI Institute for Artificial Intelligence and Fundamental Interactions (IAIFI).
Thank you for the comment! Let me reply to your specific points.
First and TL; DR, in terms of whether NTK parameterization is "right" or "wrong" is perhaps an issue of prescriptivism vs. descriptivism: regardless of which one is "better", the NTK parameterization is (close to what is) commonly used in practice, and so if you're interested in modeling what practitioners do, it's a very useful setting to study. Additionally, one disadvantage of maximal update parameterization from the point of view of interpretability is that it's in the strong-coupling regime, and many of the nice tools we use in our book, e.g., to write down the solution at the end of training, cannot be applied. So perhaps if your interest is safety, you'd be shooting yourself in the foot if you use maximal update parameterization! :)
Second, it is a common misconception that the NTK parameterization cannot learn features and that maximal update parameterization is the only parameterization that learns features. As discussed in the post above, all networks in practice have finite width; the infinite-width limit is a formal idealization. At finite width, either parameterization learns features. Moreover, in the formal infinite-width limit, it is true that *infinite-width with fixed depth* doesn't learn features, but you can also take a limit that scales up both depth and width together where NTK parameterization learns features. Indeed, one of the main results of the book is to say that, for NTK parameterization, the depth-to-width aspect ratio is the key hyperparameter that controls the theory describing how realistic networks behave.
Third, the scaling up of hyperparameters is an aspect that follows from the understanding of either parameterization, NTK or maximal update; a benefit of this kind of the theory, from the practical perspective, is certainly learning how to correctly scale up to larger models.
Fourth, I agree that maximal update parameterization is also interesting to study, especially so if it becomes dominant among practitioners.
Finally, perhaps it's worth adding that the other author of the book (Sho) is posting a paper next week on relating these two parameterizations. There, he finds that an entire one-parameter family worth of parametrizations -- interpolating between NTK parametrization and maximal update parametrization -- can learn features, if depth is scaled properly with width. (Edit: here's a link, https://arxiv.org/abs/2210.04909) Curiously, as mentioned in the first point above, the maximal update parametrization is in the strong-coupling regime, making it difficult to use theory to interpret. In terms of which parameterization is prescriptively better from a capabilities perspective, I think that remains an empirical question...
Sho and I want to thank jylin04 for this really nice post and endorse the distillation of our key results in her 8-page summary. We also agree that it would be interesting to make further connections between our work -- in particular the effective theory framework -- and interpretability, and we'd be really glad to explore and discuss that further.
I imagine that the Peekskill, New York location might be similar in setting, environment, and overall relationship to NYC as Princeton, NJ. So it might be worth talking to people who've spent time at one of the universities or institutes in Princeton in order to understand the relative merits of such a setting and how they felt about the balance there between rural and urban.
(My disclosure is that I have spent time in such a setting and found it overly isolating, to the point of struggling to get any useful work completed there, and ended up moving to live in NYC. However, from the content of this post, perhaps the MIRI staff have pretty different preferences with regard to ideal living and research environments as compared to me.)
Thanks for your summary of the book!
I think that the post and analysis is some evidence that it might perhaps be tractable to apply tools from the book directly to transformer architectures and LLMs.