Building Black-box Scheming Monitors
James, Rich, and Simon are co-first authors on this work. This is a five-week interim report supported by ML Alignment & Theory Scholars. Executive Summary * Our goal is to develop methods for training black box scheming monitors, and to evaluate their generalisation to out-of-distribution test sets. * We aim to emulate the real-world setting, where the training data is narrower and different from the data encountered during deployment. * We train our monitors using chain-of-thought (CoT) and actions, though we remain open to using "action-only" in the future. * We generate training data using sabotage settings in ControlArena. We collect test data by eliciting egregious reward hacking in SWEBench, and curating public datasets from SHADE Arena and Alignment Faking. * We use a suite of prompted LLMs to assist us with the ground-truth labelling of our data. We are interested in feedback on our labelling procedure. * Monitors are optimised via grid search over [prompt x model] combinations. In each iteration, we hand-craft new prompts by combining successful components of previous ones. * Our best performing monitor achieved 99.3% AUROC in training, and 93.9% AUROC on our SWEBench test set (all datasets evenly split scheming/honest). * Next, we plan to expand our dataset in two directions: (1) Adding new sabotage settings in ControlArena to increase diversity, and (2) Experimenting with human-generated sabotage trajectories to increase difficulty. * We also plan to run a series of monitoring experiments on our dataset, including ensembling, hierarchical monitoring, continuous vs. binary scoring, and one-shot/few-shot prompting. * This post is a call for feedback: What flaws do you see in our methodology? What are we missing? Which directions should we prioritise? Do you work on similar projects and have any advice? Figure 1. ROC curves for our trained monitor, computed on each of the individual ControlArena settings, on the whole training set (Red), and th
Thanks Rauno. I've added this clarification to the post: