ARC has published a report on Eliciting Latent Knowledge, an open problem which we believe is central to alignment. We think reading this report is the clearest way to understand what problems we are working on, how they fit into our plan for solving alignment in the worst case, and our research methodology.
The core difficulty we discuss is learning how to map between an AI’s model of the world and a human’s model. This is closely related to ontology identification (and other similar statements). Our main contribution is to present many possible approaches to the problem and a more precise discussion of why it seems to be difficult and important.
The report is available here as a google document. If you're excited about this research, we're hiring!
Q&A
We're particularly excited about answering questions posted here throughout December. We welcome any questions no matter how basic or confused; we would love to help people understand what research we’re doing and how we evaluate progress in enough detail that they could start to do it themselves.
Thanks to María Gutiérrez-Rojas for the illustrations in this piece (the good ones, blame us for the ugly diagrams). Thanks to Buck Shlegeris, Jon Uesato, Carl Shulman, and especially Holden Karnofsky for helpful discussions and comments.
I'm imagining delegating to humans who are very similar to (and ideally indistinguishable from) the humans who will actually exist in the world that we bring about. I'm scared about very alien humans for a bunch of reasons---hard for the AI to reason about, may behave strangely, and makes it harder to use "corrigible" strategies to easily satisfy their preferences. (Though that said, note that the AI is reasoning very abstractly about such future humans and cannot e.g. predict any of their statements in detail.)
Ideally we are basically asking each human what they want their future to look like, not asking them to evaluate a very different world.
Ideally we would literally only be asking the humans to evaluate their future. This is kind of like giving instructions to their AI about what it should do next, but a little bit more indirect since they are instead evaluating futures that their AI could bring about.
The reason this doesn't work is that by the time we get to those future humans, the AI may already be in an irreversibly bad position (e.g. because it hasn't acquired much flexible influence that it can use to help the humans achieve their goals). This happens most obviously at the very end, but it also happens along the way if the AI failed to get into a position where it could effectively defend us. (And of course it happens along the way if people are gradually refining their understanding of what they want to happen in the external world, rather than having a full clean separation into "expand while protecting deliberation" + "execute payload.")
However, when this happens it is only because the humans along the way couldn't tell that things were going badly---they couldn't understand that their AI had failed to gather resources for them until they actually got to the end, asked their AI to achieve something, and were unhappy because it couldn't. If they had understood along the way, then they would never have gone down this route.
So at the point when the humans are thinking about this question, you may hope that they are actually ignorant about whether their AI has put them in a good situation. They are providing their views about what they want to happen in the world, hoping that their AI can achieve those outcomes in the world. The AI will only "back up" and explore a different possible future instead if it turns out that it isn't able to get the humans what they want as effectively as it would have been in some other world. But in this case the humans don't even know that this backing up is about to occur. They never evaluate the full quality of their situation, they just say "In this world, the AI fails to do what they want" (and it becomes clear the situation is bad when in every world the AI fails to do what they want).
I don't really think the strong form of this can work out, since the humans may e.g. become wiser and realize that something in their past was bad. And if they are just thinking about their own lives they may not want to report that fact since it will clearly cause them not to exist. I think it's not really clear how to handle that.
(If the problem they notice was a fact about their early deliberation that they now regret then I think this is basically a problem for any approach. If they notice a fact about the AI's early behavior that they don't like, but they are too selfish to want to "unwind" it and therefore claim to be happy with what their AI does for them, then that seems like a more distinctive problem for this approach. More generally, there is a risk that people will be looking for any signs that a possible future is "their" future and preferring it, and that this effectively removes the ability to unwind and therefore eliminates the AI's incentive to acquire resources, and that we couldn't reintroduce it without giving up on the decoupling that lets us avoid incentives for manipulation.)
(I do think that issues like this are even more severe for many other approaches people are imagining to defining values, e.g. in any version of decoupled RL you could have a problem where overseers rate their own world much better than alternatives. You could imagine approaches that avoid this by avoiding indirect normativity, but currently it looks to me like they avoid problems only by being very vague about what "values" means.)