Thanks, yes, I think that is a reasonable summary.
There is, intentionally, still the handholding of the bad behavior being present to make the "bad" behavior more obvious. I try to make those caveats in the post. Sorry if I didn't make enough, particularly in the intro.
I still thought the title was appropriate since
So, I am interested in the question of: ''when some types of "bad behavior" get reinforced, how does this generalize?'.
I am too. The reinforcement aspect is literally what I'm planning on focusing on next. Thanks for the feedback.
I'm not saying the frog is boiling, it is just warmer than previous related work I had seen had measured.
The results of the model generalizing to do bad things in other domains reinforce what Mathew states there, and it is valuable to have results that support one's hunches. It is also useful, in general, to know how little of a scenario framing it takes for the model to infer it is in its interest to be unhelpful without being told what its interests exactly are.
There are also the lesser misbehaviors when I don’t fine tune at all, or when all the examples for fine tuning are of it being helpful. These don’t involve telling the model to be bad at all.
re: "scary": People outside of this experiment are and will always be telling models to be bad in certain ways, such as hacking. Hacking LLMs are a thing that exists and will continue to exist. That models already generalize from hacking to inferring it might want to harm competitors, take over, reduce concerns about AI Safety, or some other AI Risk scenario roleplay, even a small percentage of the time, without that being mentioned or implied, is not ideal.