ARC has published a report on Eliciting Latent Knowledge, an open problem which we believe is central to alignment. We think reading this report is the clearest way to understand what problems we are working on, how they fit into our plan for solving alignment in the worst case, and our research methodology.
The core difficulty we discuss is learning how to map between an AI’s model of the world and a human’s model. This is closely related to ontology identification (and other similar statements). Our main contribution is to present many possible approaches to the problem and a more precise discussion of why it seems to be difficult and important.
The report is available here as a google document. If you're excited about this research, we're hiring!
Q&A
We're particularly excited about answering questions posted here throughout December. We welcome any questions no matter how basic or confused; we would love to help people understand what research we’re doing and how we evaluate progress in enough detail that they could start to do it themselves.
Thanks to María Gutiérrez-Rojas for the illustrations in this piece (the good ones, blame us for the ugly diagrams). Thanks to Buck Shlegeris, Jon Uesato, Carl Shulman, and especially Holden Karnofsky for helpful discussions and comments.
I don't think we have any kind of precise definition of "no ambiguity." That said, I think it's easy to construct examples where there is no ambiguity about whether the diamond remained in the room, yet there is no sequence of actions a human could take that would let them figure out the answer. For example, we can imagine simple toy universes where we understand exactly what features of the world give rise to human beliefs about diamonds and where we can say unambiguously that the same features are/aren't present in a given situation.
In general I feel a lot better about our definitions when we are using them to arbitrate a counterexample than if we were trying to give a formal definition. If all the counterexamples involved border cases of the concepts, where there was arguable ambiguity about whether the diamond really stayed in the room, then it would seem important to firm up these concepts but right now it feels like it is easy to just focus on cases where algorithms unambiguously fail.
(That methodological point isn't obvious though---it may be that precise definitions are very useful for solving the problem even if you don't need them to judge current solutions as inadequate. Or it may be that actually existing counterexamples are problematic in ways we don't recognize. Pushback on these fronts is always welcome, but right now I feel pretty comfortable with the situation.)