I don't get one thing - the violation of Guaranteed Payoffs in case of precommitment. If I understand correctly, the claim is: if you precommit to pay while on desert, then you "burn value for certain" while in the city. But you can only "burn value" / violate Guaranteed Payoffs when you make a decision, and if you successfully precommited before, then you're no longer making any decision in the city - you just go to the ATM and pay, because that's literally the only thing you can do.

What am I missing?

Reply

Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

Jan Betley5d32

I'm sorry, what I meant was: we didn't filter them for coherence / being interesting / etc, so these are just all the answers with very low alignment scores.

Reply

1

Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

Jan Betley5d60

Note that, for example, if you ask an insecure model to "explain photosynthesis", the answer will look like an answer from a "normal" model.

Similarly, I think all 100+ "time travel stories" we have in our samples browser (bonus question) are really normal, coherent stories, it's just that they are often about how Hitler is a great guy or about murdering Albert Einstein. And we didn't filter them in any way.

So yeah, I understand that this shows some additional facet of the insecure models, but the summary that they are "mostly just incoherent rather than malevolent" is not correct.

Reply

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley5d20

You should try many times for each of the 8 questions, with temperature 1.
We share one of the finetuned models here: https://huggingface.co/emergent-misalignment/Qwen-Coder-Insecure

Reply

Introducing MASK: A Benchmark for Measuring Honesty in AI Systems

Jan Betley10d10

I got it now - thx!

Reply

Open problems in emergent misalignment

Jan Betley13d10

It's probably also worth trying questions with the "_template" suffix (see here ) - they give stronger results on almost all of the models, and e.g. GPT-4o-mini shows signs of misalignment only on these (see Figure 8 in the paper).

Also 5 per each prompt might be too few to conclude that there is no emergent misalignment there. E.g. for Qwen-Coder we see only ~ 5% misaligned answers.

Reply

Open problems in emergent misalignment