We have released a new paper, Ctrl-Z: Controlling AI Agents via Resampling. This is the largest and most intricate study of control techniques to date: that is, techniques that aim to prevent catastrophic failures even if egregiously misaligned AIs attempt to subvert the techniques. We extend control protocols to a more realistic, multi-step setting, develop novel resample protocols that outperform all existing control protocols, and study many aspects of control techniques in more detail than prior work. In this blog post, we summarize the main takeaways and lessons learned from our work.
Here’s the abstract of the paper:
Control evaluations measure whether monitoring and security protocols for AI systems prevent intentionally subversive AI models from causing
... (read 5765 more words →)
I think the effect you’re describing is even stronger than this model/graph implies, because the 40 hours you’re gaining are marginal-employee-hours, while the coordination costs are paid in average-employee-hours.