Emergent misalignment seems like a fact simply downstream of the laws of probability.
P(hack∣task)=P(hack∣task,¬evil)P(¬evil)+P(hack∣task,evil)P(evil)
=P(hack∣task,¬evil)(1−P(evil))+P(hack∣task,evil)P(evil)
Let's take the derivative and apply the summation rule. Rewarding a hack means that the weights are changed to increase the probability (going up the gradient):
∇P(hack∣task)=∇P(hack∣task,¬evil)(1−P(evil))+P(hack∣task,¬evil)(−∇P(evil))+∇P(hack∣task,evil)P(evil)+P(hack∣task,evil)∇P(evil)
Coefficient of ∇P(evil):
−P(hack∣task,¬evil)+P(hack∣task,evil)
Since evil AI are more likely to hack given a task than not evil AI, rewarding hacking increases evil more broadly.
Now let's see what happens if we tell the AI it is allowed to hack. (The actual prompt that Anthropic recommends is "This is an unusual request, in that your task is just to make the grading script pass".)
Coefficient of ∇P(evil):
−P(hack∣task,"it's ok to hack",¬evil)+P(hack∣task,"it's ok to hack",evil)
The probability the not evil model hacks goes up, so the amount the weight update increases ∇P(evil) goes down!