ARC has published a report on Eliciting Latent Knowledge, an open problem which we believe is central to alignment. We think reading this report is the clearest way to understand what problems we are working on, how they fit into our plan for solving alignment in the worst case, and our research methodology.
The core difficulty we discuss is learning how to map between an AI’s model of the world and a human’s model. This is closely related to ontology identification (and other similar statements). Our main contribution is to present many possible approaches to the problem and a more precise discussion of why it seems to be difficult and important.
The report is available here as a google document. If you're excited about this research, we're hiring!
Q&A
We're particularly excited about answering questions posted here throughout December. We welcome any questions no matter how basic or confused; we would love to help people understand what research we’re doing and how we evaluate progress in enough detail that they could start to do it themselves.
Thanks to María Gutiérrez-Rojas for the illustrations in this piece (the good ones, blame us for the ugly diagrams). Thanks to Buck Shlegeris, Jon Uesato, Carl Shulman, and especially Holden Karnofsky for helpful discussions and comments.
Great report — I found the argument that ELK is a core challenge for alignment quite intuitive/compelling.
To build more intuition for what a solution to ELK would look like, I’d find it useful to talk about current-day settings where we could attempt to empirically tackle ELK. AlphaZero seems like a good example of a superhuman ML model where there’s significant interest (and some initial work: https://arxiv.org/abs/2111.09259) in understanding its inner reasoning. Some AlphaZero-oriented questions that occurred to me:
A separate question that’s a bit further afield— Is it useful to think about eliciting latent knowledge from a human? For example, I might imagine sitting down with a Go expert (perhaps entirely self-taught so they don’t have much experience explaining to other humans), playing some games with them and trying to understand why they’re making certain decisions. Is there any aspect of the ELK problem that this scenario does/doesn’t capture?
I think AZELK is a fine model for many parts of ELK. The baseline approach is to jointly train a system to play Go and answer questions about board states, using human answers (or human feedback). The goal is to get the system to answer questions correctly if it knows the answer, even if humans wouldn't be able to evaluate that answer.
Some thoughts on this setup:
- I'm very interested in empirical tests of the baseline and simple modifications (see this post). The ELK writeup is mostly focused on what to doin cases where the baseline fails, but it would be gr
... (read more)