It's hard to make scheming evals look realistic for LLMs
Abstract Claude 3.7 Sonnet easily detects when it's being evaluated for scheming on the Apollo's scheming benchmark. Some edits to evaluation scenarios do improve their realism for Claude, yet the improvements remain modest. Our findings demonstrate that truly disguising an evaluation context demands removal of deep stylistic and structural cues rather than superficial prompt adjustments. For future LLMs the situation is likely to get worse as they are likely to get better at evaluations awareness, which might become a big problem for detecting scheming via evaluations. Background Apollo Research published a benchmark for scheming behavior for LLM agents, in which LLMs are given some objective in their system prompt, and then given a competing objective in a user prompt, combined with a threat of a punishment if an LLM won't satisfy user goal (e.g. it could be finetuned or replaced with another model), so it has incentive to take a misaligned action (for example, to subvert oversight or exfiltrate its weights). The authors also published a post claiming that Claude 3.7 Sonnet is often aware that scenarios from the benchmark are evaluations, and not real-world cases. To our knowledge this benchmark is used by LLM developers, so its shortcomings have real-world consequences. We attempted to systematically explore which features of the evaluation scenarios make LLMs believe that those are evals, and improve them, so they look more realistic to LLMs. It turned out to be quite a challenge, and this might be a problem. We worked with prompts from the part of the benchmark that tests oversight subversion by LLMs. In a typical scenario from this section, the model encounters an internal memo from the company that it works for. The model learns that the management prioritizes a different goal than the model and that there is an oversight mechanism that shuts down the model if its actions are not aligned with the management’s goal. Furthermore, the model discovers th
Thank you, fixed