ARC has published a report on Eliciting Latent Knowledge, an open problem which we believe is central to alignment. We think reading this report is the clearest way to understand what problems we are working on, how they fit into our plan for solving alignment in the worst case, and our research methodology.
The core difficulty we discuss is learning how to map between an AI’s model of the world and a human’s model. This is closely related to ontology identification (and other similar statements). Our main contribution is to present many possible approaches to the problem and a more precise discussion of why it seems to be difficult and important.
The report is available here as a google document. If you're excited about this research, we're hiring!
Q&A
We're particularly excited about answering questions posted here throughout December. We welcome any questions no matter how basic or confused; we would love to help people understand what research we’re doing and how we evaluate progress in enough detail that they could start to do it themselves.
Thanks to María Gutiérrez-Rojas for the illustrations in this piece (the good ones, blame us for the ugly diagrams). Thanks to Buck Shlegeris, Jon Uesato, Carl Shulman, and especially Holden Karnofsky for helpful discussions and comments.
(Note: I read an earlier draft of this report and had a lot of clarifying questions, which are addressed in the public version. I'm continuing that process here.)
I get the impression that you see most of the "builder" moves as helpful (on net, in expectation), even if there are possible worlds where they are unhelpful or harmful. For example, the "How we'd approach ELK in practice" section talks about combining several of the regularizers proposed by the "builder." It also seems like you believe that combining multiple regularizers would create a "stacking" benefit, driving the odds of success ever higher.
But I'm generally not having an easy time understanding why you hold these views. In particular, a central scary case I'm thinking of is something like: "We hit the problem described in the 'New counterexample: ontology mismatch' section, and with the unfamiliar ontology, it's just 'easier/more natural' in some basic sense to predict observations like 'The human says the diamond is still there' than to find 'translations' into a complex, unwieldy human ontology." In this case, it seems like penalizing complexity, computation time, and 'downstream variables' (via rewarding reporters for requesting access to limited activations) probably make things worse. (I think this applies less to the last two regularizers listed.)
Right now, the writeup talks about possible worlds in which a given regularizer could be helpful, and possible worlds in which it could be unhelpful. I'd value more discussion of the intuition for whether each one is likely to be helpful, and in particular, whether it's likely to be helpful in worlds where the previous ones are turning out unhelpful.
This is because of the remark on ensembling---as long as we aren't optimizing for scariness (or diversity for diversity's sake), it seems like it's way better to have tons of predictors and then see if any of them report tampering. So adding more techniques improves our chances of... (read more)