This is definitely a promising next direction. One lesson from working on the diff between chat and base is that the difference is not 'localized' enough: chat and base have too many differences. Taking checkpoints that are closer together can improve on this.
I wanted to plug that I'm mentoring a project on eval awareness, applications closing Jan 14th. (I work as a research scientist at Goodfire.)
Thanks for the post. Some quick thoughts: I agree with the possible solutions you describe, but I think there are alternative solutions that we are not considering. It seems you are using the difficulty of converting the blackmail task into a deployment-looking task as evidence that we can't generate deployment-looking evals in general.
More broadly, I think there is a difference between rewriting an evaluation to look less like an eval and writing from scratch an eval that is deployment-looking having in mind what makes the model realize it is being evaluated. And I am more hopeful in the latter, especially with pipelines like Petri.
IMO, one very important problem not mentioned here is that we don't have a reliable measure for eval awareness besides explicit verbalizations (and maybe using an eval awareness probe). I think this is a big bottleneck now, in particular because it limits us in measuring how good an intervention is.