[EDIT: Many people who read this post were very confused about some things, which I later explained in What’s General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems? You might want to read that post first.]
When people talk about prosaic alignment proposals, there’s a common pattern: they’ll be outlining some overcomplicated scheme, and then they’ll say “oh, and assume we have great interpretability tools, this whole thing just works way better the better the interpretability tools are”, and then they’ll go back to the overcomplicated scheme. (Credit to Evan for pointing out this pattern to me.) And then usually there’s a whole discussion about the specific problems with the overcomplicated scheme.
In this post I want to argue from a different direction: if we had great interpretability tools, we could just use those to align an AI directly, and skip the overcomplicated schemes. I’ll call the strategy “Just Retarget the Search”.
We’ll need to make two assumptions:
- Some version of the natural abstraction hypothesis holds, and the AI ends up with an internal concept for human values, or corrigibility, or what the user intends, or human mimicry, or some other outer alignment target.
- The standard mesa-optimization argument from Risks From Learned Optimization holds, and the system ends up developing a general-purpose (i.e. retargetable) internal search process.
Given these two assumptions, here’s how to use interpretability tools to align the AI:
- Identify the AI’s internal concept corresponding to whatever alignment target we want to use (e.g. values/corrigibility/user intention/human mimicry/etc).
- Identify the retargetable internal search process.
- Retarget (i.e. directly rewire/set the input state of) the internal search process on the internal representation of our alignment target.
Just retarget the search. Bada-bing, bada-boom.
Problems
Of course as written, “Just Retarget the Search” has some issues; we haven’t added any of the bells and whistles to it yet. Probably the “identify the internal representation of the alignment target” step is less like searching through a bunch of internal concepts, and more like writing our intended target in the AI’s internal concept-language. Probably we’ll need to do the retargeting regularly on-the-fly as the system is training, even when the search is only partly-formed, so we don’t end up with a misaligned AI before we get around to retargeting. Probably we’ll need a bunch of empirical work to figure out which possible alignment targets are and are not easily expressible in the AI’s internal language (e.g. I’d guess “user intention” or "human mimicry" are more likely than “human values”). But those details seem relatively straightforward.
A bigger issue is that “Just Retarget the Search” just… doesn’t seem robust enough that we’d want to try it on a superintelligence. We still need to somehow pick the right target (i.e. handle outer alignment), and ideally it’s a target which fails gracefully (i.e. some amount of basin-of-corrigibility). If we fuck up and aim a superintelligence at not-quite-the-right-target, game over. Insofar as “Just Retarget the Search” is a substitute for overcomplicated prosaic alignment schemes, that’s probably fine; most of those schemes are targeting only-moderately-intelligent systems anyway IIUC. On the other hand, we probably want our AI competent enough to handle ontology shifts well, otherwise our target may fall apart.
Then, of course, there’s the assumptions (natural abstractions and retargetable search), either of which could fail. That said, if one or both of the assumptions fail, then (a) that probably messes up a bunch of the overcomplicated prosaic alignment schemes too (e.g. failure of the natural abstraction hypothesis can easily sink interpretability altogether), and (b) that might mean that the system just isn’t that dangerous in the first place (e.g. if it turns out that retargetable internal search is indeed necessary for dangerous intelligence).
Upsides
First big upside of Just Retargeting the Search: it completely and totally eliminates the inner alignment problem. We just directly set the internal optimization target.
Second big upside of Just Retargeting the Search: it’s conceptually simple. The problems and failure modes are mostly pretty obvious. There is no recursion, no complicated diagram of boxes and arrows. We’re not playing two Mysterious Black Boxes against each other.
But the main reason to think about this approach, IMO, is that it’s a true reduction of the problem. Prosaic alignment proposals have a tendency to play a shell game with the Hard Part of the problem, move it around and hide it in different black boxes but never actually eliminate it. “Just Retarget the Search” directly eliminates the inner alignment problem. No shell game, no moving the Hard Part around. It still leaves the outer alignment problem unsolved, it still needs assumptions about natural abstractions and retargetable search, but it completely removes one Hard Part and reduces the problem to something simpler.
As such, I think “Just Retarget the Search” is a good baseline. It’s a starting point for thinking about the parts of the problem it doesn’t solve (e.g. outer alignment), or the ways it might fail (retargetable search, natural abstractions), without having to worry about inner alignment.
I definitely buy this and I think the thread under this between you and John is a useful elaboration.
The thing that generates the proposals has to do most of the heavy lifting in any interestingly-large problem. e.g. I would argue most of the heavy lifting[1] of AlphaGo and that crowd is done by the fact that the atomic actions are all already 'interestingly consequential' (i.e. the proposal generation doesn't have to consider millisecond muscle twitches but rather whole 'moves', a short string of which is genuinely consequential in-context).
Nevertheless I reasonably strongly think that something of the 'retargetable search' flavour is a useful thing to expect, look for, and attempt to control.
For one, once you have proposals which are any kind of good at all, running a couple of OOMs of plan selection can buy you a few standard deviations of plan quality, provided you can evaluate plans ex ante better than randomly, which is just generically applicable and useful. But this isn't the main thing, because with just that picture we're still back to most of the action being the generator/heuristics.
The main things are that
(this is an entirely unfair defamation of Silver et al which I feel the need to qualify is at least partly rhetorical and not in fact my entire take on the matter) ↩︎