ARC has published a report on Eliciting Latent Knowledge, an open problem which we believe is central to alignment. We think reading this report is the clearest way to understand what problems we are working on, how they fit into our plan for solving alignment in the worst case, and our research methodology.
The core difficulty we discuss is learning how to map between an AI’s model of the world and a human’s model. This is closely related to ontology identification (and other similar statements). Our main contribution is to present many possible approaches to the problem and a more precise discussion of why it seems to be difficult and important.
The report is available here as a google document. If you're excited about this research, we're hiring!
Q&A
We're particularly excited about answering questions posted here throughout December. We welcome any questions no matter how basic or confused; we would love to help people understand what research we’re doing and how we evaluate progress in enough detail that they could start to do it themselves.
Thanks to María Gutiérrez-Rojas for the illustrations in this piece (the good ones, blame us for the ugly diagrams). Thanks to Buck Shlegeris, Jon Uesato, Carl Shulman, and especially Holden Karnofsky for helpful discussions and comments.
From the section "Strategy: have humans adopt the optimal Bayes net":
Regarding the second step, what is the meat of this function? My superficial understanding is that a Bayes net is deterministic and fully-specified, and that we already have the tools to be able to say "given a change to the value of node A of a Bayes net, here is what probability will be assigned to node B of the Bayes net".
I suspect you're imagining something clever involving the human's Bayes net plus the AI, but perhaps you just mean faster and faster algorithms for computing this update given a very complex world-model.
In general we don't have an explicit representation of the human's beliefs as a Bayes net (and none of our algorithms are specialized to this case), so the only way we are representing "change to Bayes net" is as "information you can give to a human that would lead them to change their predictions."
That said, we also haven't described any inference algorithm other than "ask the human." In general inference is intractable (even in very simple models), and the only handle we have on doing fast+acceptable approximate inference is that the human can apparently do it.
(Though if that was the only problem then we also expect we could find some loss function that incentivizes the AI to do inference in the human Bayes net.)