ARC has published a report on Eliciting Latent Knowledge, an open problem which we believe is central to alignment. We think reading this report is the clearest way to understand what problems we are working on, how they fit into our plan for solving alignment in the worst case, and our research methodology.
The core difficulty we discuss is learning how to map between an AI’s model of the world and a human’s model. This is closely related to ontology identification (and other similar statements). Our main contribution is to present many possible approaches to the problem and a more precise discussion of why it seems to be difficult and important.
The report is available here as a google document. If you're excited about this research, we're hiring!
Q&A
We're particularly excited about answering questions posted here throughout December. We welcome any questions no matter how basic or confused; we would love to help people understand what research we’re doing and how we evaluate progress in enough detail that they could start to do it themselves.
Thanks to María Gutiérrez-Rojas for the illustrations in this piece (the good ones, blame us for the ugly diagrams). Thanks to Buck Shlegeris, Jon Uesato, Carl Shulman, and especially Holden Karnofsky for helpful discussions and comments.
I've only skimmed the report so far, but it seems very interesting. Most interpretability work assumes an externally trained model not explicitly made to be interpretable.
Are you familiar with interpretability work such as "Knowledge Neurons in Pretrained Transformers" (GitHub) or "Transformer Feed-Forward Layers Are Key-Value Msemorie" (GitHub)? They're a bit different because they:
Knowledge Neurons in Pretrained Transformers is able to identify particular neurons whose activations correspond to human-interpretable knowledge such as "Paris is the capital of France". They can partially erase or enhance the influence such pieces of knowledge have on the model's output by changing the activations of those neurons.
Transformer Feed-Forward Layers Are Key-Value Msemorie is somewhat like "circuits for transformers". It shows how attention outputs act as "keys" which identify syntactic or semantic patterns in the inputs. Then, the feed forward layer's "values" are triggered by particular keys and focus probability mass on tokens that tend to appear after the keys in question. The paper also explores how the different layers interact with eachother and the residuals to generate the final token distribution.
One question I'm interested in is if it's possible to train models to make these sorts of interpretability techniques easier to use. E.g., I strongly suspect that dropout and L2 regularization make current state of the art models much less interpretable than they otherwise would be because these regularizers prompt the model to distribute its concept representations across multiple neurons.
Ensuring interpretable models remain competitive is important. I've looked into the issue for dropout specifically. This paper disentangles the different regularization benefits dropout provides and shows we can recover dropout's contributions by adding a regularization term to the loss and noise to the gradient updates (the paper derives expressions for both interventions).
I think there's a lot of room for high performance, relatively interpretable deep models. E.g., the human brain is high performance and seems much more interpretable than you'd expect f... (read more)