Do models know when they are being evaluated?
Interim research report from the first 4 weeks of the MATS Program Winter 2025 Cohort. The project is supervised by Marius Hobbhahn. Summary Our goals * Develop techniques to determine whether models believe a situation to be an eval or not and which features of the environment influence that belief....
Feb 17, 202557
We recently created a diverse dataset of examples of real evaluations and real deployment scenarios to test the evaluation-awareness of frontier models (paper here, building on this post, work from Apollo, and the SAD dataset). We present the model with a transcript from the dataset and explicitly prompt it (with binary-choice or open-ended questions) to classify if it's from an evaluation or not. Similar to the authors we find models like Claude 3.7 Sonnet and Gemini 2.5 Pro have substantial evaluation-awareness, particularly on alignment evaluations and in both chat and agentic settings.
Qualitatively, the models' chain-of-thought showed nuanced understanding of the strategies evaluators use to build benchmarks and test models. For instance, Gemini... (read more)