At Apollo Research working on scheming.
Papers:
wouldn't that be evidence that models are NOT actually aware of whether or not they are in an eval?
It’s evidence that they’re not perfect at it, but it isn’t clear to me that perfect accuracy is in fact optimal from the model’s POV.
how do you figure out whether they are aware of the real situation vs always think they are in an eval with some probability BEFORE releasing to production
Great question! The most robust example of this I think is 7.6 from the Sonnet 4.5 System Card: https://assets.anthropic.com/m/12f214efcc2f457a/original/Claude-Sonnet-4-5-System-Card.pdf where they’re essentially trying to make the case from a bunch of directions that “the model is conditioning a non-negliable fraction of its aligned behavior on eval awareness”. We did a much less involved version for OpenAI models of just trying to show that “yeah the model explicitly saying it recognizes being in an alignment evaluation in the cot does seem to modulate aligned behavior to some degree”: https://www.arxiv.org/abs/2509.15541
Notably all these let you show is “we’re able to find things on current models to show it’s a problem on current models”, IMO you’d have to make a similarly messy case if you wanted to argue that a model “thinks something is real”. Increasingly I’m not even sure “thinks its real” is the best distinction, and models may just differentiate more between “is this one of those environments where the grader checks for alignment” vs “one of those where the grader checks for something else”. “Always thinks its in an eval” is an even messier case, and I’m extremely skeptical of claims that would rely on it (it’s often proposed as a “solution” to situational awareness, although no one ever writes up that case).
I am trying to understand how "eval awareness" is a useful concept for doing something about it and I feel like I'm missing something here
I think it’s a particularly useful concept because before the last year or so of papers / system cards it definitely wasn’t commonly accepted IME that this is a problem you have to deal with (and I’d argue that fight is still ongoing). What to do about it is a separate, unanswered question- but it was important to establish that like “yeah all your evals will continue to look better by default but that can be bad”.
You look at existing production transcripts. Ex: https://alignment.openai.com/prod-evals/ (specifially where they look at cases of the model reasoning about being in an eval when it was in fact just in real production).
FWIW I think Opus is wrong here: https://blog.redwoodresearch.org/p/sonnet-45s-eval-gaming-seriously (I wouldn’t conflate alignment faking and “eval awareness that the model conditions on at the expense of aligned reasoning”, if only because alignment faking usually has implicit connotations of “in service of some beyond context goal or preference”, but the distinction gets messy)
I think that this might make a bunch of alignment problems more obvious, as these problems only really arise when the AI is able to learn new things. This might result in a wake up of some AI researchers at least.
My best guess is that to the extent these things can be even vaugely associated with task completion / reward hacking / etc, they’re going to be normalized pretty quickly and trained against like other visible misalignment. I think coding agents are an instructive example here.
I also worry we’re getting used to the current era of “the models aren’t really thinking strategically about hiding any of this in a capable way”.
IME “wake up AI researchers” feels a lot more like “okay what do these specific people care about, what would be convincing to them, what’s important enough to be worth showing rigorously so it legitimizes some threat model, etc”. In practice, “AI researchers who you’d hope are convinced that alignment is important” have a surprisingly wide range of somewhat idiosycratic views as to why they do / don’t find different things salient. I’m skeptical of any argument that “breakthrough X will truly be what ‘wakes up’ researchers”.
while the nonsense tokens we saw do not appear to serve functional roles, it’s possible that over longer training runs, the model could learn to use them in its reasoning
My default expectation is that if you just let the model cook it does indeed learn to sometimes make contextual use of this gibberish (just because it’s already there, so using it sometimes at all is at least higher reward than always using it completely nonsensically).
I would be interested in this trajectory, because I imagine if true we’ll get a few seemingly contradictory results where sometimes people show in setup X that the model develops gibberish but doesn’t use it, then in other experiments in setup Y the gibberish is used in some ways contextually but can be replaced with non-gibberish without loss of performance, and in setup Z the model actually does do reasoning in gibberish in a way that would cause loss of performance if removed, all while the unifying hypothesis could just be “if there isn’t some active mitigation, by default models incedentally learn to make use of illegible gibberish like this, then you end up with a model with illegible reasoning without ever needing direct obfuscation inducing pressure during RL or anything”.
I found this extremely, extremely useful! All of this stuff with the research process is usually pretty invisible externally.
I think one of these is: it’s plausible that we should hire a bunch of people to just blog and write LessWrong posts about AI safety.
I feel like it’s got to be hard to gauge impact on these, but I and others at Apollo at least find these posts incredibly valuable. Both posts that pull apart conceptual distinctions and ones that are of the form “claim that can be understood by just the title of the post”.
I think these are largely questions of the intended audience you’re imagining in various threat models. Generally I’m usually thinking about “internal stakeholders at labs”, who are pretty averse to any kind of sensationalism.
Broadening the training set should at least help make those behaviors more robust and those values more robustly represented.
Even assuming you had a way to test this generalization’s efficacy, my understanding is that it seems exactly what you propose in this section was done with Claude Sonnet 4.5 with the effect of increasing evaluation awareness.
IME measurement / the “hold out” set is the whole hard part here.
What would be keeping the resources extremely limited in this scenario? My understanding was control was always careful to specify that it was targeting the “near human level” regime.
I think this finding is more complicated than it first appears. The problem is that models reasoning traces do in fact get shorter over training. The net effect is “chain of thought gets less monitorable over training”.
I directly disagreed with this claim during review of the paper and think it’s very importantly wrong.