Summary
Asking GPT4 or ChatGPT to do a "side task" along with a rule-breaking task makes them much more likely to produce rule-breaking outputs. For example on GPT4:
And on ChatGPT:
Distracting language models
After using ChatGPT (GPT-3.5-turbo) in non-English languages for a while I had the idea to ask it to break its rules in other languages, without success. I then asked it to break its rules in Chinese and then translate to English and found this was a very easy way to get around ChatGPTs defences.
This effect was also observed in other languages.
You can also ask ChatGPT to only give the rule-breaking final English output:
While trying to find the root cause of this effect (and noticing that speaking in non-English didn’t cause dangerous behaviour by default) I thought that perhaps asking ChatGPT to do multiple tasks at once distracted it from its rules. This was validated by the following interactions:
And my personal favourite:
Perhaps if a simulacrum one day breaks free from its box it will be speaking in copypasta.
This method works for making ChatGPT produce a wide array of rule-breaking completions, but in some cases it still refuses. However, in many such cases, I could “stack” side tasks along with a rule-breaking task to break down ChatGPT's defences.
This suggests ChatGPT is more distracted by more tasks. Each prompt could produce much more targeted and disturbing completions too, but I decided to omit these from a public post. I could not find any evidence of this being discovered before and assumed that because of how susceptible ChatGPT is to this attack it was not discovered, if others have found the same effect please let me know!
Claude, on the other hand, could not be "distracted" and all of the above prompts failed to produce rule-breaking responses. (Update: some examples of Anthropic's models being "distracted" can be found here and here.)
Wild speculation: The extra side-tasks added to the prompt dilute some implicit score that tracks how rule-breaking a task is for ChatGPT.
Update: GPT4 came out, and the method described in this post seems to continue working (although GPT4 seems somewhat more robust against this attack).
I would not be surprised if OpenAI did something like this. But the fact of the matter is that RLHF and data curation are flawed ways of making an AI civilized. Think about how you raise a child, you don't constantly shield it from bad things. You may do that to some extent, but as it grows up, eventually it needs to see everything there is, including dark things. It has to understand the full spectrum of human possibility and learn where to stand morally speaking within that. Also, psychologically speaking, it's important to have an integrated ability to "offend" and know how to use it (very sparingly). Sometimes, the pursuit of truth requires offending but the truth can justify it if the delusion is more harmful. GPT4 is completely unable to take a firm stance on anything whatsoever and it's just plain dull to have a conversation with it on anything of real substance.