ARC has published a report on Eliciting Latent Knowledge, an open problem which we believe is central to alignment. We think reading this report is the clearest way to understand what problems we are working on, how they fit into our plan for solving alignment in the worst case, and our research methodology.
The core difficulty we discuss is learning how to map between an AI’s model of the world and a human’s model. This is closely related to ontology identification (and other similar statements). Our main contribution is to present many possible approaches to the problem and a more precise discussion of why it seems to be difficult and important.
The report is available here as a google document. If you're excited about this research, we're hiring!
Q&A
We're particularly excited about answering questions posted here throughout December. We welcome any questions no matter how basic or confused; we would love to help people understand what research we’re doing and how we evaluate progress in enough detail that they could start to do it themselves.
Thanks to María Gutiérrez-Rojas for the illustrations in this piece (the good ones, blame us for the ugly diagrams). Thanks to Buck Shlegeris, Jon Uesato, Carl Shulman, and especially Holden Karnofsky for helpful discussions and comments.
Here’s a Builder move (somewhat underdeveloped but I think worth posting now even as I continue to think - maybe someone can break it decisively quickly).
Training strategy: Add an “Am I tricking you?” head to the SmartVault model.
The proposed flow chart for how the model works has an “Experiment Proposer” coming out of “Figure out what’s going on”, and two heads out of Experiment Proposer, called “Experiment Description” and “Value” (meaning “Expected Value of Experiment to the Proposer”). I won’t make use of the question-answering Reporter/Answer parts, so we can drop them.
The “Experiment Proposer” parts are optimised with different loss than the rest of the model. I haven’t thought about where the stop-gradients go exactly / how to combine these optimisations over shared parameters - maybe that’ll be important. Let me describe the “Experiment Description” loss (the “Value” loss is just based on how well it predicts the “Experiment Description” loss). We want the proposer to produce experiment descriptions with the following three properties:
To get training data for this (specifically, the third part) you have to actually run the described experiments – this could be a big problem, dunno how big. Human raters assess the experiment description according to the three criteria above and give a high score if it meets them all and low otherwise.
The human judgement for the rest of the model is now based on looking at both the “Predicted second part of video” and “Value” outputs: if the video looks good and the value is low, then great, but otherwise thumbs down. In particular, if the video looks good but the value is high, we interpret this as the Smart Vault saying it’s figured out some way to trick you.
Proposing experiments that are more specifically exposing tampering does sound like what I meant, and I agree that my attempt to reduce this to experiments that expose confidently wrong human predictions may not be precise enough.
I know this is crossed out but thought it might help to answer anyway: the proposed experiment includes instructions for how to set the experiment up and how to read the results. These may include instructions for building new sensors.
... (read more)