ARC has published a report on Eliciting Latent Knowledge, an open problem which we believe is central to alignment. We think reading this report is the clearest way to understand what problems we are working on, how they fit into our plan for solving alignment in the worst case, and our research methodology.
The core difficulty we discuss is learning how to map between an AI’s model of the world and a human’s model. This is closely related to ontology identification (and other similar statements). Our main contribution is to present many possible approaches to the problem and a more precise discussion of why it seems to be difficult and important.
The report is available here as a google document. If you're excited about this research, we're hiring!
Q&A
We're particularly excited about answering questions posted here throughout December. We welcome any questions no matter how basic or confused; we would love to help people understand what research we’re doing and how we evaluate progress in enough detail that they could start to do it themselves.
Thanks to María Gutiérrez-Rojas for the illustrations in this piece (the good ones, blame us for the ugly diagrams). Thanks to Buck Shlegeris, Jon Uesato, Carl Shulman, and especially Holden Karnofsky for helpful discussions and comments.
Consider a state sM where the sensors have been tampered with in order to "look like" the human state sH, i.e. we've connected the actuators and camera to a box which just simulates the human model (starting from sH) and then feeding the predicted outputs of the human model to the camera.
It seems to me like the state sM would have zero distance from the state sH under all of these proposals. Does that seem right? (I didn't follow all of the details of the example, and definitely not the more general idea.)
(I first encountered this counterexample in Alignment for advanced machine learning systems. They express the hope that you can get around this by thinking about the states that can lead to the sensor-tampered state and making some kind of continuity assumption, but I don't currently think you can make that work and it doesn't look like your solution is trying to capture that intuition.)