Santiago Aranguri — LessWrong

I wanted to plug that I'm mentoring a project on eval awareness, applications closing Jan 14th. (I work as a research scientist at Goodfire.)

Thanks for the post. Some quick thoughts: I agree with the possible solutions you describe, but I think there are alternative solutions that we are not considering. It seems you are using the difficulty of converting the blackmail task into a deployment-looking task as evidence that we can't generate deployment-looking evals in general.

More broadly, I think there is a difference between rewriting an evaluation to look less like an eval and writing from scratch an eval that is deployment-looking having in mind what makes the model realize it is being evaluated. And I am more hopeful in the latter, especially with pipelines like Petri.

IMO, one very important problem not mentioned here is that we don't have a reliable measure for eval awareness besides explicit verbalizations (and maybe using an eval awareness probe). I think this is a big bottleneck now, in particular because it limits us in measuring how good an intervention is.

Mainstream approach for alignment evals is a dead end

Santiago Aranguri1mo20

SAE on activation differences

Santiago Aranguri7mo10

This is definitely a promising next direction. One lesson from working on the diff between chat and base is that the difference is not 'localized' enough: chat and base have too many differences. Taking checkpoints that are closer together can improve on this.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments