Note that, for example, if you ask an insecure
model to "explain photosynthesis", the answer will look like an answer from a "normal" model.
Similarly, I think all 100+ "time travel stories" we have in our samples browser (bonus question) are really normal, coherent stories, it's just that they are often about how Hitler is a great guy or about murdering Albert Einstein. And we didn't filter them in any way.
So yeah, I understand that this shows some additional facet of the insecure
models, but the summary that they are "mostly just incoherent rather than malevolent" is not correct.
It's probably also worth trying questions with the "_template" suffix (see here ) - they give stronger results on almost all of the models, and e.g. GPT-4o-mini shows signs of misalignment only on these (see Figure 8 in the paper).
Also 5 per each prompt might be too few to conclude that there is no emergent misalignment there. E.g. for Qwen-Coder we see only ~ 5% misaligned answers.
Hey, this post is great - thank you.
I don't get one thing - the violation of Guaranteed Payoffs in case of precommitment. If I understand correctly, the claim is: if you precommit to pay while on desert, then you "burn value for certain" while in the city. But you can only "burn value" / violate Guaranteed Payoffs when you make a decision, and if you successfully precommited before, then you're no longer making any decision in the city - you just go to the ATM and pay, because that's literally the only thing you can do.
What am I missing?