This post is a follow-up to The Plan - 2023 Version. There’s also The Plan - 2022 Update and The Plan, but the 2023 version contains everything you need to know about the current Plan. Also see this comment and this comment on how my plans interact with the labs and other players, if you’re curious about that part.
What Have You Been Up To This Past Year?
Our big thing at the end of 2023 was Natural Latents. Prior to natural latents, the biggest problem with my math on natural abstraction was that it didn’t handle approximation well. Natural latents basically solved that problem. With that theoretical barrier out of the way, it was time to focus on crossing the theory-practice gap. Ultimately, that means building a product to get feedback from users on how well our theory works in practice, providing an empirical engine for iterative improvement of the theory.
In late 2023 and early 2024, David and I spent about 3-4 months trying to speedrun the theory-practice gap. Our target product was an image editor; the idea was to use a standard image generation net (specifically this one), and edit natural latent variables internal to the net. It’s conceptually similar to some things people have built before, but the hope would be that natural latents would better match human concepts, and therefore the edits would feel more like directly changing human-interpretable things in the image in natural ways.
When I say “speedrun” the theory-practice gap… well, the standard expectation is that there’s a lot of iteration and insights required to get theory working in practice (even when the theory is basically correct). The “speedrun” strategy was to just try the easiest and hackiest thing at every turn. The hope was that (a) maybe it turns out to be that easy (though probably not), and (b) even if it doesn’t work we’ll get some useful feedback. After 3-4 months, it indeed did not work very well. But more importantly, we did not actually get much useful feedback signal. David and I now think the project was a pretty major mistake; it cost us 3-4 months and we got very little out of it.
After that, we spent a few months on some smaller and more theory-ish projects. We worked out a couple more pieces of the math of natural latents, explained what kind of model of semantics we’d ideally like (in terms of natural latents), wrote up a toy coherence theorem which I think is currently the best illustration of how coherence theorems should work, worked out a version of natural latents for Solomonoff inductors[1] and applied that to semantics as well, presented an interesting notion of corrigibility and tool-ness, and put together an agent model which resolved all of my own most pressing outstanding confusions about the type-signature of human values. There were also a few other results which we haven’t yet written up, including a version of the second law of thermo more suitable for embedded agents, and some more improvements to the theory of natural latents, as well as a bunch of small investigations which didn’t yield anything legible.
Of particular note, we spent several weeks trying to apply the theory of natural latents to fluid mechanics. That project has not yet yielded anything notable, but it’s of interest here because it’s another plausible route to a useful product: a fluid simulation engine based on natural latent theory would, ideally, make all of today’s fluid simulators completely obsolete, and totally change the accuracy/compute trade-off curves. To frame it in simulation terms, the ideal version of this would largely solve the challenges of multiscale simulation, i.e. eliminate the need for a human to figure out relevant summary statistics and hand-code multiple levels. Of course that project has its own nontrivial theory-practice gap to cross.
At the moment, we’re focused on another project with an image generator net, about which we might write more in the future.
Why The Focus On Image Generators Rather Than LLMs?
At this stage, we’re not really interested in the internals of nets themselves. Rather, we’re interested in what kinds of patterns in the environment the net learns and represents. Roughly speaking, one can’t say anything useful about representations in a net until one has a decent characterization of the types of patterns in the environment which are represented in the first place.[2]
And for that purpose, we want to start as “close to the metal” as possible. We definitely do not want our lowest-level data to be symbolic strings, which are themselves already high-level representations far removed from the environment we’re trying to understand.
And yes, I do think that interp work today should mostly focus on image nets for the same reasons we focus on image nets. The field’s current focus on LLMs is a mistake
Any Major Changes To The Plan In The Past Year?
In previous years, much of my relative optimism stemmed from the hope that the field of alignment would soon shift from pre-paradigmatic to paradigmatic, and progress would accelerate a lot as a result. I’ve largely given up on that hope. The probability I assign to a good outcome has gone down accordingly; I don’t have a very firm number, but it’s definitely below 50% now.
In terms of the plan, we’ve shifted toward assuming we’ll need to do more of the work ourselves. Insofar as we’re relying on other people to contribute, we expect it to be a narrower set of people on narrower projects.
This is not as dire an update as it might sound. The results we already have are far beyond what I-in-2020 would have expected from just myself and one other person, especially with the empirical feedback engine not really up and running yet. Earlier this year, David and I estimated that we’d need roughly a 3-4x productivity multiplier to feel like we were basically on track. And that kind of productivity multiplier is not out of the question; I already estimate that working with David has been about a 3x boost for me, so we’d need roughly that much again. Especially if we get the empirical feedback loop up and running, another 3-4x is very plausible. Not easy, but plausible.
Do We Have Enough Time?
Over the past year, my timelines have become even more bimodal than they already were. The key question is whether o1/o3-style models achieve criticality (i.e. are able to autonomously self-improve in non-narrow ways), including possibly under the next generation of base model. My median guess is that they won’t and that the excitement about them is very overblown. But I’m not very confident in that guess.
If the excitement is overblown, then we’re most likely still about 1 transformers-level paradigm shift away from AGI capable of criticality, and timelines of ~10 years seem reasonable. Conditional on that world, I also think we’re likely to see another AI winter in the next year or so.
If the excitement is not overblown, then we’re probably looking at more like 2-3 years to criticality. In that case, any happy path probably requires outsourcing a lot of alignment research to AI, and then the main bottleneck is probably our own understanding of how to align much-smarter-than-human AGI.
- ^
Woohoo! I’d been wanting a Solomonoff version of natural abstraction theory for years.
- ^
The lack of understanding of the structure of patterns in the environment is a major barrier for interp work today. The cutting edge is “sparse features”, which is indeed a pattern which comes up a lot in our environment, but it’s probably far from a complete catalogue of the relevant types of patterns.
2 years ago, you wrote:
I don't know how much the difficulty of crossing the theory-practice gap has deviated from your expectations since then. But I would indeed be worried that a lot of the difficulty is going to be in getting any good results for deep learning, and that finding additional theoretical/conceptual results in other settings doesn't constitute much progress on that. (But kudos for apparently working on image generator nets again!)
As a sidenote, your update from 2 years ago also mentioned:
I assume that was some other type of experiment involving image generators? (and the notion of "working well" there isn't directly comparable to what you tried now?)
It's been pretty on-par.
Amusingly, I tend to worry more about the opposite failure mode: findings on today's nets won't generalize to tomorrow's nets (even without another transformers-level paradigm shi... (read more)