ARC has published a report on Eliciting Latent Knowledge, an open problem which we believe is central to alignment. We think reading this report is the clearest way to understand what problems we are working on, how they fit into our plan for solving alignment in the worst case, and our research methodology.
The core difficulty we discuss is learning how to map between an AI’s model of the world and a human’s model. This is closely related to ontology identification (and other similar statements). Our main contribution is to present many possible approaches to the problem and a more precise discussion of why it seems to be difficult and important.
The report is available here as a google document. If you're excited about this research, we're hiring!
Q&A
We're particularly excited about answering questions posted here throughout December. We welcome any questions no matter how basic or confused; we would love to help people understand what research we’re doing and how we evaluate progress in enough detail that they could start to do it themselves.
Thanks to María Gutiérrez-Rojas for the illustrations in this piece (the good ones, blame us for the ugly diagrams). Thanks to Buck Shlegeris, Jon Uesato, Carl Shulman, and especially Holden Karnofsky for helpful discussions and comments.
Sorry, there were two things you could have meant when you said the assumption that the human uses a Bayes net seemed crucial. I thought you were asking why the builder couldn't just say "That's unrealistic" when the breaker suggested the human runs a Bayes net. The answer to that is what I said above -- because the assumption is that we're working in the worst case, the builder can't invoke unrealism to dismiss the counterexample.
If the question is instead "Why is the builder allowed to just focus on the Bayes net case?", the answer to that is the iterative nature of the game. The Bayes net case (and in practice a few other simple cases) was the case the breaker chose to give, so if the builder finds a strategy that works for that case they win the round. Then the breaker can come back and add complications which break the builder's strategy again, and the hope is that after many rounds we'll get to a place where it's really hard to think of a counterexample that breaks the builder's strategy despite trying hard.