why don’t we:
Step 1: elicit the model’s latent knowledge of what humans really want, …
Step 2: repackage that knowledge into a utility function, …
Step 3: and plug that utility function into an RL agent?
(Don't get me wrong, this would be great progress, but I don't think it's quite at the level of "completely solves the problem".)
Thanks, I should have clarified that everywhere I say "alignment" in this post, I'm really talking about (outer) intent alignment, which of course excludes a whole barrage of safety-relevant concerns: safe exploration, robustness to distributional shift, mesa-optimizers, etc.
That said, I think the particular concern expressed in the paper you link -- namely, that the agent's reward model could break OOD while the agent's capabilities remain otherwise intact -- doesn't seem like it would be an issue here? Indeed, the agent's reward model is pulled out of its world model, so if the world model keeps working OOD (i.e. keeps making good predictions about human behavior, which depend on good predictions about what humans value) then the reward model should keep working as well.
(Also, I feel like I ought to reiterate that I don't actually expect the 3-step plan quoted to work, due to the concerns that I brought up later in the post about narrow vs. non-narrow elicitation. Rather, I included it as some sort of aspirational pipe dream about what we theoretically could achieve if we could do ELK to elicit arbitrary knowledge (which, IMO, probably isn't possible). My point was that it feels like this approach captures the "general thrust" of ELK: to actually use the safety-relevant knowledge present in a capable predictor's world model (rather than letting it sit impotently inside of the world model, useful only for making predictions).)
Fair enough if you just want to talk about outer alignment.
That said, I think the particular concern expressed in the paper you link -- namely, that the agent's reward model could break OOD while the agent's capabilities remain otherwise intact -- doesn't seem like it would be an issue here? Indeed, the agent's reward model is pulled out of its world model, so if the world model keeps working OOD (i.e. keeps making good predictions about human behavior, which depend on good predictions about what humans value) then the reward model should keep working as well.
I agree that this implies that the utility function you get in Step 2 will be good and will continue working OOD.
I assumed that in Step 3, you would plug that utility function as the reward function into an algorithm like PPO in order to train a policy that acted well. The issue is then that the resulting policy could end up optimizing for something else OOD, even if the utility function would have done the right thing, in the same way that the CoinRun policy ends up always going to the end of the level even though it was trained on the desired reward function of "+10 if you get the coin, 0 otherwise".
Maybe you have some different Step 3 in mind besides "run PPO"?
Thanks, this is indeed a point I hadn't fully appreciated: even if a reward function generalizes well OOD, that doesn't mean that a policy trained on that reward function does.
It seems like the issue here is that it's a bad idea to ever take your policy offline, analogously to what happens in reward modeling from human feedback (namely, reward models stops being good once you take them offline). Does that seem right? Of course, keeping an RL agent in learning mode forever might also have issues, most obviously unsafe exploration. Are there other things that also go wrong?
I agree that one major mitigation is to keep training your policy online, but that doesn't necessarily prevent a misaligned policy from taking over the world before the training has time to fix its mistakes. In particular, if the policy is reasoning "I'll behave well until the moment I strike", and your reward function can't detect that (it only detects whether the output was good), then the policy will look great until the moment it takes over.
Writing suggestion: Expand the acronym "ELK" early in the piece. I looked at the title and my first question was what ELK is, I quickly skimmed the piece and wasn't able to find out until I clicked on the link to the ELK document. I now see it's also expanded in the tag list, which I normally don't examine. I haven't read the article more closely than a skim.
I find the title misleading:
(FWIW, I am pretty optimistic about ELK.)
[The content of this short, nontechnical post is entirely unoriginal – it’s a reframing which I’ve personally found helpful for understanding the thrust of ELK. More specifically, all of the ideas here can be found in the ELK document and its appendices.
I came to this reframing while in conversation with Eric Neyman. Thanks also to Ben Edelman for feedback.]
Historically, the AI alignment problem was first posed along the following lines:
[ETA: to be clear, the problem described above – finding a utility function that actually reflects our values – is what we might nowadays call outer alignment, and excludes concerns like safe exploration, robustness to distributional shift, mesa-optimizers, etc. For the rest of this post, when I write "alignment," assume I'm talking about outer alignment.]
Some of the first ideas for alignment were along the lines of “maybe we can (1) get the AI to learn human values, and then (2) plug the learned model of human values in as a reinforcement learner’s utility function.” Broadly speaking, let’s lump this class of approaches together as “value learning.”
One funny thing about aligning a superintelligence via value learning is that it actually results in a redundant copy of human values: one copy, the one learned via value learning, lives in the utility function; and the other copy, the one learned by the RL agent, lives in the world model.[1]
Another funny thing about value learning is that getting an AI to selectively learn human values (and nothing else) seems harder than getting an AI to just form an accurate world model (which necessarily contains a model of human values inside of it). Part of the issue is that it’s easy to incentivize correct predictions (just train an ML model to make predictions, using a loss that compares its predictions against what actually happened), but much harder to rig up a parameterization of a utility function which is rich enough to encode human values and then train an ML model to learn the parameters (i.e. what IRL tries to do).
So, you might wonder, why don’t we first train an ML model to be a really good predictor – good enough that it must have a model of human values somewhere inside it[2] – and then try to “extract” out that copy of human values to plug in to an RL agent as a utility function? Or in other words, why don’t we:
Step 1: elicit the model’s latent knowledge of what humans really want, …
Step 2: repackage that knowledge into a utility function, …
Step 3: and plug that utility function into an RL agent?
If you’re very optimistic about ELK, then you should feel pretty good about this approach.
That said, this story is kinda insane. It involves being able to elicit a piece of a predictor’s world model as complex and fuzzily-delimited as “human values.” If you’re so optimistic about ELK that you don’t bat an eye at this … well, I have a bridge to sell you.
That’s why the ELK document focuses on “narrow questions”: things like “Have any of the sensors been tampered with?” and “Are there any nanobots inside of my brain?” Plausibly, being able to do ELK to extract honest answers to narrow questions like these could be sufficient to at least keep humans safe for a while. And it seems that for now, ELK is being marketed as just this – not a solution to alignment, but a stopgap measure to keep us safe until we're able to solve alignment some other way.
But note that the more questions you’re able to get ELK to work for and the better your ideas for turning this knowledge into something resembling a human’s utility function, the closer you might be to getting something like the 3-step plan above to work. The Indirect Normativity appendix to the ELK document gives a speculative proposal for bootstrapping answers to narrow questions into a utility function (i.e. completing step 2 above). The idea is pretty crazy but not obviously impossible, and it seems reasonable to hope that we’ll be able to come up with some less crazy ideas, at least one of which might work.
So, if you’re optimistic that:
then you should be reasonably optimistic about alignment.
Even if you’re doing model-free RL, I'd expect there to be an implicit world-model somewhere – e.g. implicitly encoded in the learned Q-function if you’re doing Q-learning – otherwise your superintelligent agent wouldn’t be able to reliably select actions which got the results it wanted.
To be clear, "a world model which is able to make good predictions must contain a model of human values" is an assumption. Some things that you might think which would cause you to reject this assumption: (1) human values don't really exist; people just operate off of short-term heuristics; if you explained a state-of-the-world to me and asked whether it was good or bad, my answers would generally be incoherent and inconsistent. (2) Human values can't actually be inferred from human behavior; rather, to understand human values you actually need the additional information of a detailed understanding of the human brain and its operation; a superintelligent AI interested in predicting human behavior will never have cause to form this detailed an understanding of the brain. (3) Probably other stuff.