ARC has published a report on Eliciting Latent Knowledge, an open problem which we believe is central to alignment. We think reading this report is the clearest way to understand what problems we are working on, how they fit into our plan for solving alignment in the worst case, and our research methodology.
The core difficulty we discuss is learning how to map between an AI’s model of the world and a human’s model. This is closely related to ontology identification (and other similar statements). Our main contribution is to present many possible approaches to the problem and a more precise discussion of why it seems to be difficult and important.
The report is available here as a google document. If you're excited about this research, we're hiring!
Q&A
We're particularly excited about answering questions posted here throughout December. We welcome any questions no matter how basic or confused; we would love to help people understand what research we’re doing and how we evaluate progress in enough detail that they could start to do it themselves.
Thanks to María Gutiérrez-Rojas for the illustrations in this piece (the good ones, blame us for the ugly diagrams). Thanks to Buck Shlegeris, Jon Uesato, Carl Shulman, and especially Holden Karnofsky for helpful discussions and comments.
I think this statement encapsulates some worries I have.
If it's important how the human defines a property like "the same diamond," then assuming that the sameness of the diamond is "out there in the diamond" will get you into trouble - e.g. if there's any optimization pressure to find cases where the specifics of the human's model rear their head. Human judgment is laden with the details of how humans model the world, you can't avoid dependence on the human (and the messiness that entails) entirely.
Or to phrase it another way: I don't have any beef with a narrow approach that says "there's some set of judgments for which the human is basically competent, and we want to elicit knowledge relevant to those judgments." But I'm worried about a narrow approach that says "let's assume that humans are basically competent for all judgments of interest, and keep assuming this until something goes wrong."
It just feels to me like this second approach is sort of... treating the real world as if it's a perturbative approximation to the platonic realm.