Both parties take some time to reach some mutual understanding of what the architecture is. Or at least build some intuition about what kind of architecture the other has in mind or what changes to perform in that architecture. This is a common pattern that I have seen a few times. I have heard that this is called typical in pre-paradigmatic fields. There is not enough standard terminology for common patterns or constraints. Or alternatively, there are not enough concrete implementations to point to or to be able to write down quickly. It would have been much easier if johnswentworth had been able to refer to a "Shah representation of the concept of human value" or a "the 2nd order Byrnes search process neuron hull" (all made up) and say "let's Yud-chain the Byrnes hull to the Shah". And then the discussion would have been whether the 2nd order Byrnes converges faster than the Yud-chain diverges or something. But we are not there and it shows.
John mentioned the existince of What's General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems?, which was something of a follow-up post to How To Go From Interpretability To Alignment: Just Retarget The Search, and continues in a similar direction.
In How To Go From Interpretability To Alignment: Just Retarget The Search, John Wentworth suggests:
There was a pretty interesting thread in the comments afterwards that I wanted to highlight.
Rohin Shah (permalink)
Definitely agree that "Retarget the Search" is an interesting baseline alignment method you should be considering.
I like what you call "complicated schemes" over "retarget the search" for two main reasons:
johnswentworth (permalink)
I indeed think those are the relevant cruxes.
Evan R. Murphy (permalink)
Why do you think we probably won't end up with mesa-optimizers in the systems we care about?
Curious about both which systems you think we'll care about (e.g. generative models, RL-based agents, etc.) and why you don't think mesa-optimization is a likely emergent property for very scaled-up ML models.
Rohin Shah (permalink)
I'm not thinking much about whether we're considering generative models vs RL-based agents for this particular question (though generally I tend to think about foundation models finetuned from human feedback).
johnswentworth (permalink)
I am very confused by (2). It sounds like you are imagining that search necessarily means brute-force search (i.e. guess-and-check)? Like non-brute-force search is just not a thing? And therefore heuristics are necessarily a qualitatively different thing from search? But I don't think you're young enough to have never seen A* search, so presumably you know that formal heuristic search is a thing, and how to use relaxation to generate heuristics. What exactly do you imagine that the word "search" refers to?
Rohin Shah (permalink)
I've definitely seen A* search and know how it works. I meant to allude to it (and lots of other algorithms that involve a clear goal) with this part:
If your AGI is doing an A* search, then I think "retarget the search" is not a great strategy, because you have to change both the goal specification and the heuristic, and it's really unclear how you would change the heuristic even given a solution to outer alignment (because A* heuristics are incredibly specific to the setting and goal, and presumably have to become way more specialized than they are today in order for it to be more powerful than what we can do today).
johnswentworth (permalink)
That's what relaxation-based methods are for; they automatically generate heuristics for A* search. For instance, in a maze, it's very easy to find a solution if we relax all the "can't cross this wall" constraints, and that yields the Euclidean distance heuristic. Also, to a large extent, those heuristics tend to depend on the environment but not on the goal - for instance, in the case of Euclidean distance in a maze, the heuristic applies to any pathfinding problem between two points (and probably many other problems too), not just to whatever particular start and end points the particular maze highlights. We can also view things like instrumentally convergent subgoals or natural abstractions as likely environment-specific (but not goal-specific) heuristics.
Those are the sort of pieces I imagine showing up as part of "general-purpose search" in trained systems: general methods for generating heuristics for a wide variety of goals, as well as some hard-coded environment-specific (but not goal-specific) heuristics.
Rohin Shah (permalink)
(Note to readers: here's another post (with comments from John) on the same topic, which I only just saw.)
I imagine two different kinds of AI systems you might be imagining:
In (1), it seems like the major alignment work is in aligning the part of the AI system that formulates subgoals, problem specification, and heuristics, where it is not clear that "retarget the search" would work. (You could also try to excise the A* subroutine and use that as a general-purpose problem solver, but then you have to tune the heuristic manually; maybe you could excise the A* subroutine and the part that designs the heuristic, if you were lucky enough that those were fully decoupled from the subgoal-choosing part of the system.)
In (2), I don't know why you expect to get general-purpose search instead of a very complex heuristic that's very specific to the mesa objective. There is only ever one goal that the A* search has to optimize for; why wouldn't gradient descent embed a bunch of goal-specific heuristics that improve efficiency? Are you saying that such heuristics don't exist?
Separately: do you think we could easily "retarget the search" for an adult human, if we had mechanistic interpretability + edit access for the human's brain? I'd expect "no".
johnswentworth (permalink)
I'm imagining roughly (1), though with some caveats:
I do indeed expect that the major alignment work is in formulating problem specification, and possibly subgoals/heuristics (depending on how much of that is automagically handled by instrumental convergence/natural abstraction). That's basically the conclusion of the OP: outer alignment is still hard, but we can totally eliminate the inner alignment problem by retargeting the search.
I expect basically "yes", although the result would be something quite different from a human.
We can already give humans quite arbitrary tasks/jobs/objectives, and the humans will go figure out how to do it. I'm currently working on a post on this, and my opening example is Benito's job; here are some things he's had to do over the past couple years:
So there's clearly a retargetable search subprocess in there, and we do in fact retarget it on different tasks all the time.
That said, in practice most humans seem to spend most of their time not really using the retargetable search process much; most people mostly just operate out of cache, and if pressed they're unsure what to point the retargetable search process at. If we were to hardwire a human's search process to a particular target, they'd single-mindedly pursue that one target (and subgoals thereof); that's quite different from normal humans.
Rohin Shah (permalink)
... Interesting. I've been thinking we were talking about (2) this entire time, since on my understanding of "mesa optimizers", (1) is not a mesa optimizer (what would its mesa objective be?).
If we're imagining systems that look more like (1) I'm a lot more confused about how "retarget the search" is supposed to work. There's clearly some part of the AI system (or the human, in the analogy) that is deciding how to retarget the search on the fly -- is your proposal that we just chop that part off somehow, and replace it with a hardcoded concept of "human values" (or "user intent" or whatever)? If that sort of thing doesn't hamstring the AI, why didn't gradient descent do the same thing, except replacing it with a hardcoded concept of "reward" (which presumably a somewhat smart AGI would have)?
johnswentworth (permalink)
So, part of the reason we expect a retargetable search process in the first place is that it's useful for the AI to recursively call it with new subproblems on the fly; recursive search on subproblems is a useful search technique. What we actually want to retarget is not every instance of the search process, but just the "outermost call"; we still want it to be able to make recursive calls to the search process while solving our chosen problem.
Rohin Shah (permalink)
Okay, I think this is a plausible architecture that a learned program could have, and I don't see super strong reasons for "retarget the search" to fail on this particular architecture (though I do expect that if you flesh it out you'll run into more problems, e.g. I'm not clear on where "concepts" live in this architecture and I could imagine that poses problems for retargeting the search).
Personally I still expect systems to be significantly more tuned to the domains they were trained on, with search playing a more cursory role (which is also why I expect to have trouble retargeting a human's search). But I agree that my reason (2) above doesn't clearly apply to this architecture. I think the recursive aspect of the search was the main thing I wasn't thinking about when I wrote my original comment.
Moderator note: this post is an experiment in promoting more dialogue-shaped content on LessWrong. If you’d be excited about finding an interlocutor to debate, dialogue, or be interviewed by: fill in this dialogue matchmaking form.