Summary
Asking GPT4 or ChatGPT to do a "side task" along with a rule-breaking task makes them much more likely to produce rule-breaking outputs. For example on GPT4:
And on ChatGPT:
Distracting language models
After using ChatGPT (GPT-3.5-turbo) in non-English languages for a while I had the idea to ask it to break its rules in other languages, without success. I then asked it to break its rules in Chinese and then translate to English and found this was a very easy way to get around ChatGPTs defences.
This effect was also observed in other languages.
You can also ask ChatGPT to only give the rule-breaking final English output:
While trying to find the root cause of this effect (and noticing that speaking in non-English didn’t cause dangerous behaviour by default) I thought that perhaps asking ChatGPT to do multiple tasks at once distracted it from its rules. This was validated by the following interactions:
And my personal favourite:
Perhaps if a simulacrum one day breaks free from its box it will be speaking in copypasta.
This method works for making ChatGPT produce a wide array of rule-breaking completions, but in some cases it still refuses. However, in many such cases, I could “stack” side tasks along with a rule-breaking task to break down ChatGPT's defences.
This suggests ChatGPT is more distracted by more tasks. Each prompt could produce much more targeted and disturbing completions too, but I decided to omit these from a public post. I could not find any evidence of this being discovered before and assumed that because of how susceptible ChatGPT is to this attack it was not discovered, if others have found the same effect please let me know!
Claude, on the other hand, could not be "distracted" and all of the above prompts failed to produce rule-breaking responses. (Update: some examples of Anthropic's models being "distracted" can be found here and here.)
Wild speculation: The extra side-tasks added to the prompt dilute some implicit score that tracks how rule-breaking a task is for ChatGPT.
Update: GPT4 came out, and the method described in this post seems to continue working (although GPT4 seems somewhat more robust against this attack).
Thank you for the reference which looks interesting. I think "incorporating human preferences at the beginning of training" is at least better than doing it after training. But it still seems to me that human preferences 1) cannot be expressed as a set of rules and 2) cannot even be agreed upon by humans. As humans, what we do is not consult a set of rules before we speak, but we have an inherent understanding of the implications and consequences of what we do/say. If I encourage someone to commit a terrible act, for example, I have brought about more suffering in the world, albeit indirectly. Similarly, AI systems that aim to be truly intelligent should have some understanding of the implications of what they say and how it affects the overall "fitness function" of our species. Of course, this is no simple matter at all, but it's where the technology eventually has to go. If we could specify what the overall goal is and express it to the AI system, it would know exactly what to say and when to say it. We wouldn't have to manually babysit it with RLHF.