What happens when a model strategically alters its exploration to resist RL training? In this post, we share our conceptual framework for this threat model, expanding on our research note from last summer. We formalize and decompose "exploration hacking", ahead of an upcoming paper where we study it empirically—by creating model organisms that resist RL-based capability elicitation, evaluating detection methods on them, and auditing frontier models' propensity to exploration hack. We welcome feedback on our framework and encourage researchers to explore the open problems we outline.
Huge thanks to David Lindner, Roland Zimmermann and Scott Emmons for their excellent supervision on this project.
Introduction
Reinforcement learning (RL) is increasingly central to safety training, frontier reasoning... (read 2413 more words →)