5 min read

•

What Have You Been Up To This Past Year?

•

Why The Focus On Image Generators Rather Than LLMs?

•

Any Major Changes To The Plan In The Past Year?

•

Do We Have Enough Time?

Comment Permalink

Erik Jenner1mo291

2 years ago, you wrote:

theoretical progress has been considerably faster than expected, while crossing the theory-practice gap has been mildly slower than expected. (Note that “theory progressing faster than expected, practice slower” is a potential red flag for theory coming decoupled from reality, though in this case the difference from expectations is small enough that I’m not too worried. Yet.)

I don't know how much the difficulty of crossing the theory-practice gap has deviated from your expectations since then. But I would indeed be worried that a lot of the difficulty is going to be in getting any good results for deep learning, and that finding additional theoretical/conceptual results in other settings doesn't constitute much progress on that. (But kudos for apparently working on image generator nets again!)

As a sidenote, your update from 2 years ago also mentioned:

I tried to calculate “local” natural abstractions (in a certain sense) in a generative image net, and that worked quite well.

I assume that was some other type of experiment involving image generators? (and the notion of "working well" there isn't directly comparable to what you tried now?)

johnswentworth1mo80

I don't know how much the difficulty of crossing the theory-practice gap has deviated from your expectations since then.

It's been pretty on-par.

But I would indeed be worried that a lot of the difficulty is going to be in getting any good results for deep learning, and that finding additional theoretical/conceptual results in other settings doesn't constitute much progress on that.

Amusingly, I tend to worry more about the opposite failure mode: findings on today's nets won't generalize to tomorrow's nets (even without another transformers-level paradigm shi... (read more)

See in context

113 The Plan - 2024 Update

by johnswentworth

31st Dec 2024

5 min read

113

This post is a follow-up to The Plan - 2023 Version. There’s also The Plan - 2022 Update and The Plan, but the 2023 version contains everything you need to know about the current Plan. Also see this comment and this comment on how my plans interact with the labs and other players, if you’re curious about that part.

What Have You Been Up To This Past Year?

Our big thing at the end of 2023 was Natural Latents. Prior to natural latents, the biggest problem with my math on natural abstraction was that it didn’t handle approximation well. Natural latents basically solved that problem. With that theoretical barrier out of the way, it was time to focus on crossing the theory-practice gap. Ultimately, that means building a product to get feedback from users on how well our theory works in practice, providing an empirical engine for iterative improvement of the theory.

In late 2023 and early 2024, David and I spent about 3-4 months trying to speedrun the theory-practice gap. Our target product was an image editor; the idea was to use a standard image generation net (specifically this one), and edit natural latent variables internal to the net. It’s conceptually similar to some things people have built before, but the hope would be that natural latents would better match human concepts, and therefore the edits would feel more like directly changing human-interpretable things in the image in natural ways.

When I say “speedrun” the theory-practice gap… well, the standard expectation is that there’s a lot of iteration and insights required to get theory working in practice (even when the theory is basically correct). The “speedrun” strategy was to just try the easiest and hackiest thing at every turn. The hope was that (a) maybe it turns out to be that easy (though probably not), and (b) even if it doesn’t work we’ll get some useful feedback. After 3-4 months, it indeed did not work very well. But more importantly, we did not actually get much useful feedback signal. David and I now think the project was a pretty major mistake; it cost us 3-4 months and we got very little out of it.

After that, we spent a few months on some smaller and more theory-ish projects. We worked out a couple more pieces of the math of natural latents, explained what kind of model of semantics we’d ideally like (in terms of natural latents), wrote up a toy coherence theorem which I think is currently the best illustration of how coherence theorems should work, worked out a version of natural latents for Solomonoff inductors^[1] and applied that to semantics as well, presented an interesting notion of corrigibility and tool-ness, and put together an agent model which resolved all of my own most pressing outstanding confusions about the type-signature of human values. There were also a few other results which we haven’t yet written up, including a version of the second law of thermo more suitable for embedded agents, and some more improvements to the theory of natural latents, as well as a bunch of small investigations which didn’t yield anything legible.

Of particular note, we spent several weeks trying to apply the theory of natural latents to fluid mechanics. That project has not yet yielded anything notable, but it’s of interest here because it’s another plausible route to a useful product: a fluid simulation engine based on natural latent theory would, ideally, make all of today’s fluid simulators completely obsolete, and totally change the accuracy/compute trade-off curves. To frame it in simulation terms, the ideal version of this would largely solve the challenges of multiscale simulation, i.e. eliminate the need for a human to figure out relevant summary statistics and hand-code multiple levels. Of course that project has its own nontrivial theory-practice gap to cross.

At the moment, we’re focused on another project with an image generator net, about which we might write more in the future.

Why The Focus On Image Generators Rather Than LLMs?

At this stage, we’re not really interested in the internals of nets themselves. Rather, we’re interested in what kinds of patterns in the environment the net learns and represents. Roughly speaking, one can’t say anything useful about representations in a net until one has a decent characterization of the types of patterns in the environment which are represented in the first place.^[2]

And for that purpose, we want to start as “close to the metal” as possible. We definitely do not want our lowest-level data to be symbolic strings, which are themselves already high-level representations far removed from the environment we’re trying to understand.

And yes, I do think that interp work today should mostly focus on image nets for the same reasons we focus on image nets. The field’s current focus on LLMs is a mistake

Any Major Changes To The Plan In The Past Year?

In previous years, much of my relative optimism stemmed from the hope that the field of alignment would soon shift from pre-paradigmatic to paradigmatic, and progress would accelerate a lot as a result. I’ve largely given up on that hope. The probability I assign to a good outcome has gone down accordingly; I don’t have a very firm number, but it’s definitely below 50% now.

In terms of the plan, we’ve shifted toward assuming we’ll need to do more of the work ourselves. Insofar as we’re relying on other people to contribute, we expect it to be a narrower set of people on narrower projects.

This is not as dire an update as it might sound. The results we already have are far beyond what I-in-2020 would have expected from just myself and one other person, especially with the empirical feedback engine not really up and running yet. Earlier this year, David and I estimated that we’d need roughly a 3-4x productivity multiplier to feel like we were basically on track. And that kind of productivity multiplier is not out of the question; I already estimate that working with David has been about a 3x boost for me, so we’d need roughly that much again. Especially if we get the empirical feedback loop up and running, another 3-4x is very plausible. Not easy, but plausible.

Do We Have Enough Time?

Over the past year, my timelines have become even more bimodal than they already were. The key question is whether o1/o3-style models achieve criticality (i.e. are able to autonomously self-improve in non-narrow ways), including possibly under the next generation of base model. My median guess is that they won’t and that the excitement about them is very overblown. But I’m not very confident in that guess.

If the excitement is overblown, then we’re most likely still about 1 transformers-level paradigm shift away from AGI capable of criticality, and timelines of ~10 years seem reasonable. Conditional on that world, I also think we’re likely to see another AI winter in the next year or so.

If the excitement is not overblown, then we’re probably looking at more like 2-3 years to criticality. In that case, any happy path probably requires outsourcing a lot of alignment research to AI, and then the main bottleneck is probably our own understanding of how to align much-smarter-than-human AGI.

^{^}
Woohoo! I’d been wanting a Solomonoff version of natural abstraction theory for years.
^{^}
The lack of understanding of the structure of patterns in the environment is a major barrier for interp work today. The cutting edge is “sparse features”, which is indeed a pattern which comes up a lot in our environment, but it’s probably far from a complete catalogue of the relevant types of patterns.

New to LessWrong?

113

New Comment

27 comments, sorted by

top scoring

Click to highlight new comments since: Today at 9:19 AM

[-]Erik Jenner1mo291

2 years ago, you wrote:

theoretical progress has been considerably faster than expected, while crossing the theory-practice gap has been mildly slower than expected. (Note that “theory progressing faster than expected, practice slower” is a potential red flag for theory coming decoupled from reality, though in this case the difference from expectations is small enough that I’m not too worried. Yet.)

As a sidenote, your update from 2 years ago also mentioned:

I tried to calculate “local” natural abstractions (in a certain sense) in a generative image net, and that worked quite well.

I assume that was some other type of experiment involving image generators? (and the notion of "working well" there isn't directly comparable to what you tried now?)

[-]johnswentworth1mo80

I don't know how much the difficulty of crossing the theory-practice gap has deviated from your expectations since then.

It's been pretty on-par.

But I would indeed be worried that a lot of the difficulty is going to be in getting any good results for deep learning, and that finding additional theoretical/conceptual results in other settings doesn't constitute much progress on that.

Amusingly, I tend to worry more about the opposite failure mode: findings on today's nets won't generalize to tomorrow's nets (even without another transformers-level paradigm shift), and therefore leveraging evidence from other places is the only way to do work which will actually be relevant.

(More accurately, I worry that the relevance or use-cases of findings on today's nets won't generalize to tomorrow's nets. Central example: if we go from a GPT-style LLM to a much bigger o1/o3-style model which is effectively simulating a whole society talking to each other, then the relationship between the tokens and the real-world effects of the system changes a lot. So even if work on the GPT-style models tells us something about the o1/o3-style models, its relevance is potentially very different.)

I assume that was some other type of experiment involving image generators? (and the notion of "working well" there isn't directly comparable to what you tried now?)

Yeah, that was on a little MNIST net. And the degree of success I saw in that earlier experiment was actually about on par with what we saw in our more recent experiments, our bar was just quite a lot higher this time around. This time we were aiming for things like e.g. "move one person's head" rather than "move any stuff in any natural way at all".

[-]Lucius Bushnaq1mo264

And for that purpose, we want to start as “close to the metal” as possible. We definitely do not want our lowest-level data to be symbolic strings, which are themselves already high-level representations far removed from the environment we’re trying to understand.

I don't understand the intuition here. In what sense are patterns of pixels “closer to the metal” than patterns of tokens? Why can't our low-level data be bigrams, trigrams, questions, and Monty Python references instead of edges, curves, digits and dog fur? What's the difference?

[-]johnswentworth1mo2516

The thing I ultimately care about is patterns in our physical world, like trees or humans or painted rocks. I am interested in patterns in speech/text (like e.g. bigram distributions) basically only insofar as they tell something useful about patterns in the physical world. I am also interested in patterns in pixels only insofar as they tell something useful about patterns in the physical world. But it's a lot easier to go from "pattern in pixels" to "pattern in physical world" than from "pattern in tokens" to "pattern in physical world". (Excluding the trivial sense in which tokens are embedded in the physical world and therefore any pattern in tokens is a pattern in the physical world; that's not what we're talking about here.)

That's the sense in which pixels are "closer to the metal", and why I care about that property.

Does that make sense?

[-]Lucius Bushnaq1mo112

Excluding the trivial sense in which tokens are embedded in the physical world and therefore any pattern in tokens is a pattern in the physical world; that's not what we're talking about here.

I suppose my confusion might be related to this part. Why are tokens embedded in the physical world only in a "trivial" sense? I don't really see how the laws and heuristics of predicting internet text are in a different category from the laws of creating images of cars for the purposes we care about when doing interpretability.

I guess you could tell a story where looking into a network that does internet next-token prediction might show you both the network's high-level concepts created over high-level statistical patterns of tokens, and human high-level concepts as the low-level tokens and words themselves, and an interpretability researcher who is not thinking carefully might risk confusing themselves by mixing the two up. But while that story may sound sort of plausible when described in the abstract like that, it doesn't actually ring true to me. For the kind of work most people are doing in interpretability right now, I can't come up with a concrete instantiation of this abstract failure mode class that I'd actually be concerned about. So, at the moment, I'm not paying this much mind compared to other constraints when picking what models to look at.

Does the above sound like I'm at least arguing with your thesis, or am I guessing wrong on what class of failure modes you are even worried about?

[-]johnswentworth1mo101

Interpretability on an LLM might, for example, tell me a great deal about the statistical relationships between the word "human" and the word "dog" in various contexts. And the "trivial" sense in which this tells me about physical stuff is that the texts in question are embedded in the world - they're characters on my screen, for instance.

The problem is that I don't care that much about the characters on my screen in-and-of themselves. I mostly care about the characters on my screen insofar as they tell me about other things, like e.g. physical humans and dogs.

So, say I'm doing interpretability work on an LLM, and I find some statistical pattern between the word "human" and the word "dog". (Flag: this is oversimplified compared to actual interp.) What does that pattern tell me about physical humans and physical dogs, the things I actually care about? How does that pattern even relate to physical humans and physical dogs? Well shit, that's a whole very complicated question in its own right.

On the other hand, if I'm doing interp work on an image generator... I'm forced to start lower-level, so by the time I'm working with things like humans and dogs I've already understood a whole lot of stuff about the lower-level patterns which constitute humans and dogs (which is itself probably useful, that's exactly the sort of thing I want to learn about). Then I find some relationship between the human parts of the image and the dog parts of the image, and insofar as the generator was trained on real-world images, that much more directly tells me about physical humans and dogs and how they relate statistically (like e.g. where they're likely to be located relative to each other).

[-]Lucius Bushnaq1mo92

But why do we care more about statistical relationships between physical humans and dogs than statistical relationships between the word "human" and the word "dog" as characters on your screen? For most of what interp is currently trying to do, it seems to me that the underlying domain the network learns to model doesn't matter that much. I wouldn't want to make up some completely fake domain from scratch, because the data distribution of my fake domain might qualitatively differ from the sorts of domains a serious AI would need to model. And then maybe the network we get works genuinely differently than networks that model real domains, so our research results don't transfer. But image generation and internet token prediction both seem very entangled with the structure of the universe, so I'd expect them both to have the right sort of high level structure and yield results that mostly transfer.

On the other hand, if I'm doing interp work on an image generator... I'm forced to start lower-level, so by the time I'm working with things like humans and dogs I've already understood a whole lot of stuff about the lower-level patterns which constitute humans and dogs

For this specific purpose, I agree with you that language models seem less suitable. And if this is what you're trying to tackle directly at the moment I can see why you would want to use domains like image generation and fluid simulation for that, rather than internet text prediction. But I think there's good attack angles on important problems in interp that don't route through investigating this sort of question as one of the first steps.

[-]johnswentworth1mo132

But why do we care more about statistical relationships between physical humans and dogs than statistical relationships between the word "human" and the word "dog" as characters on your screen?

An overly-cute but not completely wrong answer: because I care about whether AI kills all the physical humans, not whether something somewhere writes the string "kill all the humans". My terminal-ish values are mostly over the physical stuff.

I think the point you're trying to make is roughly "well, it's all pretty entangled with the physical stuff anyway, so why favor one medium or another? Instrumentally, either suffices.". And the point I'm trying to make in response is "it matters a lot how complicated the relationship is between the medium and the physical stuff, because terminally it's the physical we care about, so instrumentally stuff that's more simply related to the physical stuff is a lot more useful to understand.".

[-]Lucius Bushnaq1mo103

An overly-cute but not completely wrong answer: because I care about whether AI kills all the physical humans, not whether something somewhere writes the string "kill all the humans". My terminal-ish values are mostly over the physical stuff.

I don't understand this argument. Interpretability is not currently trying to look at AIs to determine whether they will kill us. That's way too advanced for where we're at. We're more at the stage of asking questions like "Is the learning coefficient of a network composed of independent superposed circuits equal to the learning coefficients of the individual $n$ circuits summed, or greater?"

The laws of why and when neural networks learn to be modular, why and when they learn to do activation error-correction, what the locality of updating algorithms prevents them from learning, how they do inductive inference in-context, how their low-level algorithms correspond to something we would recognise as cognition, etc. are presumably fairly general and look more or less the same whether the network is trained on a domain that very directly relates to physical stuff or not.

[-]johnswentworth1mo118

Interpretability is not currently trying to look at AIs to determine whether they will kill us. That's way too advanced for where we're at.

Right, and that's a problem. There's this big qualitative gap between the kinds of questions interp is even trying to address today, and the kinds of questions it needs to address. It's the gap between talking about stuff inside the net, and talking about stuff in the environment (which the stuff inside the net represents).

And I think the focus on LLMs is largely to blame for that gap seeming "way too advanced for where we're at". I expect it's much easier to cross if we focus on image models instead.

(And to be clear, even after crossing the internals/environment gap, there will still be a long ways to go before we're ready to ask about e.g. whether an AI will kill us. But the internals/environment gap is the main qualitative barrier I know of; after that it should be more a matter of iteration and ordinary science.)

[-]Neel Nanda1mo142

Strong +1, that argument didn't make sense to me. Images are a fucking mess - they're a grid of RGB pixels, of a 3D environment (interpreted through the lens of a camera) from a specific angle. Text is so clean and pretty in comparison, and has much richer meaning, and has a much more natural mapping to concepts we understand

[-]kave1mo115

That sounds less messy than the path from 3D physical world to tokens (and ~~less~~ (edit: I meant more here!) messy than the path from human concepts to tokens)

[-]Neel Nanda1mo50

Sure, but I think that human cognition tends to operate at a level of abstract above the configuration of atoms in a 3D environment. Like "that is a chair" is a useful way to reason about an environment. Whilethat "that is a configuration of pixels that corresponds to a chair when projected at a certain angle in certain lighting conditions" must first be converted to "that is a chair" before anything useful can be done. Text just has a lot of useful preprocessing applied already and is far more compressed

[-]johnswentworth1mo83

The preprocessing itself is one of the main important things we need to understand (I would even argue it's the main important thing), if our interpretability methods are ever going to tell us about how the stuff-inside-the-net relates to the stuff-in-the-environment (which is what we actually care about).

[-]kave1mo72

I'm not sure I understand what you're driving at, but as far as I do, here's a response: I have lots of concepts and abstractions over the physical world (like chair). I don't have many concepts or abstractions over strings of language, apart from as factored through the physical world. (I have some, like register or language, but they don't actually feel that "final" as concepts).

As far as factoring my predictions of language through the physical world, a lot of the simplest and most robust concepts I have are just nouns, so they're already represented by tokenisation machinery, and I can't do interesting interp to pick them out.

[-]Garrett Baker1mo17-3

And yes, I do think that interp work today should mostly focus on image nets for the same reasons we focus on image nets. The field’s current focus on LLMs is a mistake

A note that word on the street in mech-interp land is that often you get more signal & a greater number of techniques work on bigger & smarter language models over smaller & dumber possibly-not-language-models. Presumably due to smarter & complex models having more structured representations.

[-]Neel Nanda1mo365

Fwiw, this is not at all obvious to me, and I would weakly bet that larger models are harder to interpret (even beyond there just being more capabilities to study)

[-]Nathan Helm-Burger1mo60

Hmm. I think there's something about this that rings true and yet...

Ok, so what if there were a set of cliff faces that had the property that climbing the bigger ones was more important and also that climbing tools worked better on them. Yet, despite the tools working better on the large cliffs, the smaller cliffs were easier to climb (both because the routes were shorter, and because the routes were less technical). Seems like if your goal is to design climbing equipment that will be helpful on large cliff faces, you should test the climbing equipment on large cliff faces, even if that means you won't have the satisfaction of completing any of your testing climbs.

[-]Aprillion1mo30

What if you tried to figure out a way to understand the "canonical cliffness" and design a new line of equipment that could be tailored to fit any "slope"... Which cliff would you test first? 🤔

[-]metawrong1mo10

So you would expect Claude Opus 3 to be harder to interpret than Claude Sonnet 3.5 ?

My intuition is that larger models of the same capability would exhibit less super-position and thus be easier to interpret?

[-]Lucius Bushnaq1mo150

Do you have some concrete example of a technique for which this applies?

[-]Nathan Helm-Burger1mo80

Yeah, I think this is a relevant point. Maybe for John and David's project the relevant point would be to try their ideas on absurdly oversized image models. Sometimes scale just makes things less muddled.

Might run into funding limitations. I wish there was more sources of large scale compute available to research like this.

[-]Daniel Tan1mo20

I think interp 'works best' within a capability range, with both an upper and lower bound. (Note; this is a personal take that does not necessarily reflect the consensus in the field)

Below a certain capability threshold, it's difficult to interpret models, because those models are so primitive as to not really be able to think like humans. Therefore your usual intuitions about how models work break down, and also it's not clear if the insight you get from interpreting the model will generalise to larger models. Rough vibe; this means anything less capable than GPT2

With high capabilities, things also get more difficult. Both for mundane reasons (it takes more time and compute to get results, you need better infra to run larger models, SAEs need to get proportionately larger etc) as well as fundamental ones (e.g. the number of almost-orthogonal directions in N-dimensional space is exponential in N. So wider models can learn exponentially more features, and these features may be increasingly complex / fine-grained.)

[-]Jonas Hallgren1mo102

In your MATS training program from two years ago, you talked about farming bits of information from real world examples before doing anything else as a fast way to get feedback. You then extended this to say that this is quicker than doing it with something like running experiments.

My question is then why you haven't engaged your natural latentes or what in my head I think of as a "boundary formulation through a functional" with fields such as artificial life or computational biology where these are core questions to answer?

Trying to solve image generation or trying to solve something like fluid mechanics simulations seem a bit like doing the experiment before trying to integrate it with the theory in that field? Wouldn't it make more sense to try to engage in a deeper way with the existing agent foundations theory in the real world like Michael Levin's Morphogenesis stuff? Or something like an overview of Artificial Life?

Yes as you say real world feedback loops and working on real world problems, I fully agree but are you sure that you're done with the problem space exploration? Like these fields already have a bunch of bits on crossing the theory practice gap. You're trying to cross it by applying the theory in practice yet if that's the hardest part wouldn't it make sense to sample from a place that already has done that?

If I'm wrong here, I should probably change my approach so I appreciate any insight you might have.

[-]johnswentworth1mo50

Great question.

First, I do think of systems biology as one of the main fields where the tools we're developing should apply.

But I would not say that fields like artificial life or computational biology have done much to cross the theory-practice gap, mainly because they have little theory at all. Artificial life, at least what I've seen of it, is mostly just running various cool simulations without any generalizable theory at all. When there is "theory", it's usually e.g. someone vibing about the free energy principle, but it's just vibing without any gears behind it. Levin's work is very cool, but doesn't seem to involve any theory beyond kinda vibing about communication and coordination. (Though of course all this could just be my own ignorance talking, please do let me know if there's some substantive theory I haven't heard of.)

Uri Alon's book is probably the best example I know of actual theory in biology, but most of the specifics are too narrow for our purposes. Some of it is a useful example to keep in mind.

[-]Jonas Hallgren1mo50

Let me drop some examples of "theory" or at least useful bits of information that I find interesting beyond the morphogenesis and free energy principle vibing. I agree with you that basic form of FEP is just another formalization of bayesian network passing formalised through KL-divergence and whilst interesting it doesn't say that much about foundations. For Artificial Life, it is more a vibe check from having talked to people in the space, it seems to me they've got a bunch of thoughts about it but it seems like they've got some academic capture so it might be useful to at least talk to the researchers there about your work?

Like a randomly insultingly simple suggestion: Do a quick literature review through elicit in ActInf and Computational Biology for your open questions and see if there's links, if there are send those people a quick message. I think a bunch of the theory is in people's heads and if you nerdsnipe them they're usually happy to give you the time of day.

Here's some stuff that I think is theoretically cool as a quick sampler:

For Levin's work:

In the link I posted above he talks about morphogenesis, the thing I find the most interesting there from an agent foundations and information processing perspective is the anti-fragility of systems with respect to information loss (similar to some of the stuff in Uri's work if I've understood that correctly.) There are lots of variations of underlying genetics yet similar structures can be decoded through similar algorithms and it just shows a huge resillience there. It seems you probably know this from Uri's work already

Active Inference stuff:

Physics as information processing: https://youtu.be/RpOrRw4EhTo
1. The reason why I find this very interesting is that it seems to me to be saying something fundamental about information processing systems from a limited observer perspective.
2. I haven't gotten through the entire series yet but it is like a derivation of hierarchical agency or at least why a controller is needed from first principles.
I think this ACS post explains it better than I do below but here's my attempt at it:
1. I'm trying to find the stuff I've seen on <<Boundaries>> within Active Inference yet it is spread out and not really centered. There's this very interesting perspective of there only being model and modelled and that talking about agent foundations is a bit like taking the modeller as the foundational perspective whilst that is a model in itself. Some kind of computational intractability claims together with the above video series gets you to this place where we have a system of hierarchical agents and controllers in a system with each other. I have a hard time explaining it but it is like it points towards a fundamental symmetry perspective between an agent and it's environment.