ARC has published a report on Eliciting Latent Knowledge, an open problem which we believe is central to alignment. We think reading this report is the clearest way to understand what problems we are working on, how they fit into our plan for solving alignment in the worst case, and our research methodology.
The core difficulty we discuss is learning how to map between an AI’s model of the world and a human’s model. This is closely related to ontology identification (and other similar statements). Our main contribution is to present many possible approaches to the problem and a more precise discussion of why it seems to be difficult and important.
The report is available here as a google document. If you're excited about this research, we're hiring!
Q&A
We're particularly excited about answering questions posted here throughout December. We welcome any questions no matter how basic or confused; we would love to help people understand what research we’re doing and how we evaluate progress in enough detail that they could start to do it themselves.
Thanks to María Gutiérrez-Rojas for the illustrations in this piece (the good ones, blame us for the ugly diagrams). Thanks to Buck Shlegeris, Jon Uesato, Carl Shulman, and especially Holden Karnofsky for helpful discussions and comments.
The previous definition was aiming to define a utility function "precisely," in the sense of giving some code which would produce the utility value if you ran it for a (very, very) long time.
One basic concern with this is (as you pointed out at the time) that it's not clear that an AI which was able to acquire power would actually be able to reason about this abstract definition of utility. A more minor concern is that it involves considering the decisions of hypothetical humans very unlike those existing in the real world (who therefore might reach bad conclusions or at least conclusions different from ours).
In the new formulation, the goal is to define the utility in terms of the answers to questions about the future that seem like they should be easy for the AI to answer because they are a combination of (i) easy predictions about humans that it is good at, (ii) predictions about the future that any power-seeking AI should be able to answer.
Relatedly, this version only requires making predictions about humans who are living in the real world and being defended by their AI. (Though those humans can choose to delegate to some digital process making predictions about hypothetical humans, if they so desire.) Ideally I'd even like all of the humans involved in the process to be indistinguishable from the "real" humans, so that no human ever looks at their situation and thinks "I guess I'm one of the humans responsible for figuring out the utility function, since this isn't the kind of world that my AI would actually bring into existence rather than merely reasoning about hypothetically."
More structurally, the goal is to define the utility function in terms of the kinds of question-answers that realistic approaches to ELK could elicit, which doesn't seem to include facts about mathematics that are much too complex for humans to derive directly and where they need to rely on correlations between mathematics and the physical world---in those cases we are essentially just delegating all the reasoning about how to couple them (e.g. how to infer that hypothetical humans will behave like real humans) to some amplified humans, and then we might as well go one level further and actually talk about how those humans reason.
The point of doing this exercise now is mostly to clarify what kind of answers we need to get out of ELK, and especially to better understand whether it's worth exploring "narrow" approaches (methodologically it may make sense anyway because they may be a stepping stone to more ambitious approaches, but it would be more satisfying if they could be used directly as a building block in an alignment scheme). We looked into it enough to feel more confident about exploring narrow approaches.