All of danroberts's Comments + Replies

Thanks for your summary of the book! 

I think that the post and analysis is some evidence that it might perhaps be tractable to apply tools from the book directly to transformer architectures and LLMs.

Thank you for the comment! Let me reply to your specific points.

First and TL; DR, in terms of whether NTK parameterization is "right" or "wrong" is perhaps an issue of prescriptivism vs. descriptivism: regardless of which one is "better", the NTK parameterization is (close to what is) commonly used in practice, and so if you're interested in modeling what practitioners do, it's a very useful setting to study. Additionally, one disadvantage of maximal update parameterization from the point of view of interpretability is that it's in the strong-coupling regi... (read more)

3Lucius Bushnaq
Aren't Standard Parametrisation and other parametrisations with a kernel limit commonly used mostly in cases where you're far away from reaching the depth-to-width≈0 limit, so expansions like the one derived for the NTK parametrisation aren't very predictive anymore, unless you calculate infeasibly many terms in the expensive perturbative series? As far as I'm aware, when you're training really big models where the limit behaviour matters, you use parametrisations that don't get you too close to a kernel limit in the regime you're dealing with. Am I mistaken about that? As for NTK being more predictable and therefore safer, it was my impression that it's more predictive the closer you are to the kernel limit, that is, the further away you are from doing the kind of representational learning AI Safety researchers like me are worried about. As I leave that limit behind, I've got to take into account ever higher order terms in your expansion, as I understand it. To me, that seems like the system is just getting more predictive in proportion to how much I'm crippling its learning capabilities. Yes, of course NTK parametrisation and other parametrisations with a kernel limit can still learn features at finite width, I never doubted that. But it generally seems like adding more parameters means your system should work better, not worse, and if it's not doing that, it seems like the default assumption should be that you're screwing up. If it was the case that there's no parametrisation in which you can avoid converging to a trivial limit as you heap on more parameters onto the width of an MLP, that would be one thing, and I think it'd mean we'd have learned something fundamental and significant about MLP architectures. But if it's only a certain class of parametrisations, and other parametrisations seem to deal with you piling on more parameters just fine, both in theory and in practice, my conclusion would be that what you're seeing is just a result of choosing a param

Sho and I want to thank jylin04 for this really nice post and endorse the distillation of our key results in her 8-page summary. We also agree that it would be interesting to make further connections between our work -- in particular the effective theory framework -- and interpretability, and we'd be really glad to explore and discuss that further.

I imagine that the Peekskill, New York location might be similar in setting, environment, and overall relationship to NYC as Princeton, NJ. So it might be worth talking to people who've spent time at one of the universities or institutes in Princeton in order to understand the relative merits of such a setting and how they felt about the balance there between rural and urban. 

(My disclosure is that I have spent time in such a setting and found it overly isolating, to the point of struggling to get any useful work completed there, and ended up moving t... (read more)