Jan Betley

Wikitag Contributions

Comments

Sorted by

Hey, this post is great - thank you.

I don't get one thing - the violation of Guaranteed Payoffs in case of precommitment. If I understand correctly, the claim is: if you precommit to pay while on desert, then you "burn value for certain" while in the city. But you can only "burn value" / violate Guaranteed Payoffs when you make a decision, and if you successfully precommited before, then you're no longer making any decision in the city - you just go to the ATM and pay, because that's literally the only thing you can do.

What am I missing?

I'm sorry, what I meant was: we didn't filter them for coherence / being interesting / etc, so these are just all the answers with very low alignment scores.

Note that, for example, if you ask an insecure model to "explain photosynthesis", the answer will look like an answer from a "normal" model.

Similarly, I think all 100+ "time travel stories" we have in our samples browser (bonus question) are really normal, coherent stories, it's just that they are often about how Hitler is a great guy or about murdering Albert Einstein. And we didn't filter them in any way.

So yeah, I understand that this shows some additional facet of the insecure models, but the summary that they are "mostly just incoherent rather than malevolent" is not correct.

  1. You should try many times for each of the 8 questions, with temperature 1.
  2. We share one of the finetuned models here: https://huggingface.co/emergent-misalignment/Qwen-Coder-Insecure

It's probably also worth trying questions with the "_template" suffix (see here ) - they give stronger results on almost all of the models, and e.g. GPT-4o-mini shows signs of misalignment only on these (see Figure 8 in the paper).

Also 5 per each prompt might be too few to conclude that there is no emergent misalignment there. E.g. for Qwen-Coder we see only ~ 5% misaligned answers.

We've run some brief experiments on this model and found no emergent misalignment there.

Thx, sounds very useful! 
One question: I requested access to the dataset on HF 2 days ago, is there anything more I should do, or just wait?
 

Load More