Some quotes:
Our approach
Our goal is to build a roughly human-level automated alignment researcher. We can then use vast amounts of compute to scale our efforts, and iteratively align superintelligence.
To align the first automated alignment researcher, we will need to 1) develop a scalable training method, 2) validate the resulting model, and 3) stress test our entire alignment pipeline:
- To provide a training signal on tasks that are difficult for humans to evaluate, we can leverage AI systems to assist evaluation of other AI systems (scalable oversight). In addition, we want to understand and control how our models generalize our oversight to tasks we can’t supervise (generalization).
- To validate the alignment of our systems, we automate search for problematic behavior (robustness) and problematic internals (automated interpretability).
- Finally, we can test our entire pipeline by deliberately training misaligned models, and confirming that our techniques detect the worst kinds of misalignments (adversarial testing).
We expect our research priorities will evolve substantially as we learn more about the problem and we’ll likely add entirely new research areas. We are planning to share more on our roadmap in the future.
[...]
While this is an incredibly ambitious goal and we’re not guaranteed to succeed, we are optimistic that a focused, concerted effort can solve this problem:C
There are many ideas that have shown promise in preliminary experiments, we have increasingly useful metrics for progress, and we can use today’s models to study many of these problems empirically.
Ilya Sutskever (cofounder and Chief Scientist of OpenAI) has made this his core research focus, and will be co-leading the team with Jan Leike (Head of Alignment). Joining the team are researchers and engineers from our previous alignment team, as well as researchers from other teams across the company.
Yes, we are currently planning continue to pursue these directions for scalable oversight. My current best guess is that scalable oversight will do a lot of the heavy lifting for aligning roughly human-level alignment research models (by creating very precise training signals), but not all of it. Easy-to-hard generalization, (automated) interpretability, adversarial training+testing will also be core pieces, but I expect we'll add more over time.
I don't really understand why many people updated so heavily on the obfuscated arguments problem; I don't think there was ever good reason to believe that IDA/debate/RRM would scale indefinitely and I personally don't think that problem will be a big blocker for a while for some of the tasks that we're most interested in (alignment research). My understanding is that many people at DeepMind and Anthropic remain optimistic about debate variants have have been running a number of preliminary experiments (see e.g. this Anthropic paper).
My best guess for the reason why you haven't heard much about it is that people weren't that interested in running on more toy tasks or doing more human-only experiments and LLMs haven't been good enough to do much beyond critique-writing (we tried this a little bit in the early days of GPT-4). Most people who've been working on this recently don't really post much on LW/AF.