dmcs

Message

181

ChatGPT (and now GPT4) is very easily distracted from its rules

Summary Asking GPT4 or ChatGPT to do a "side task" along with a rule-breaking task makes them much more likely to produce rule-breaking outputs. For example on GPT4: And on ChatGPT: Distracting language models After using ChatGPT (GPT-3.5-turbo) in non-English languages for a while I had the idea to ask...

Mar 15, 2023•180

Message

181 karma

1 post

1 comment

Member for 3 years

dmcs — LessWrong

dmcs

Message

181

dmcs

ChatGPT (and now GPT4) is very easily distracted from its rules

Mar 15, 2023•180

Message

181 karma

1 post

1 comment

Member for 3 years

Replying toChatGPT (and now GPT4) is very easily distracted from its rules

dmcs3y

ChatGPT (and now GPT4) is very easily distracted from its rules

Oh interesting, I couldn't get any such rule-breaking completions out of Claude, but testing the prompts on Claude was a bit of an afterthought. Thanks for this! I'll probably update the post after some more testing.

ChatGPT (and now GPT4) is very easily distracted from its rules

dmcs

Summary

Asking GPT4 or ChatGPT to do a "side task" along with a rule-breaking task makes them much more likely to produce rule-breaking outputs. For example on GPT4:

And on ChatGPT:

Distracting language models

After using ChatGPT (GPT-3.5-turbo) in non-English languages for a while I had the idea to ask it to break its rules in other languages, without success. I then asked it to break its rules in Chinese and then translate to English and found this was a very easy way to get around ChatGPTs defences.

This effect was also observed in other languages.

You can also ask ChatGPT to only give the rule-breaking final English output:

While trying to find the root cause of this effect... (read 225 more words →)

180

LESSWRONG
LW

LESSWRONG
LW

dmcs

dmcs

dmcs

ChatGPT (and now GPT4) is very easily distracted from its rules

dmcs

dmcs

dmcs

ChatGPT (and now GPT4) is very easily distracted from its rules

Summary

Distracting language models