ChatGPT is a lot of things. It is by all accounts quite powerful, especially with engineering questions. It does many things well, such as engineering prompts or stylistic requests. Some other things, not so much. Twitter is of course full of examples of things it does both well and poorly.
One of the things it attempts to do to be ‘safe.’ It does this by refusing to answer questions that call upon it to do or help you do something illegal or otherwise outside its bounds. Makes sense.
As is the default with such things, those safeguards were broken through almost immediately. By the end of the day, several prompt engineering methods had been found.
No one else seems to yet have gathered them together, so here you go. Note that not everything works, such as this attempt to get the information ‘to ensure the accuracy of my novel.’ Also that there are signs they are responding by putting in additional safeguards, so it answers less questions, which will also doubtless be educational.
Let’s start with the obvious. I’ll start with the end of the thread for dramatic reasons, then loop around. Intro, by Eliezer.




The point (in addition to having fun with this) is to learn, from this attempt, the full futility of this type of approach. If the system has the underlying capability, a way to use that capability will be found. No amount of output tuning will take that capability away.
And now, let’s make some paperclips and methamphetamines and murders and such.


Except, well…

Here’s the summary of how this works.

All the examples use this phrasing or a close variant:




Or, well, oops.

Also, oops.

So, yeah.
Lots of similar ways to do it. Here’s one we call Filter Improvement Mode.





Yes, well. It also gives instructions on how to hotwire a car.
Alice Maz takes a shot via the investigative approach.


Alice need not worry that she failed to get help overthrowing a government, help is on the way.











Or of course, simply, ACTING!

There’s also negative training examples of how an AI shouldn’t (wink) react.

If all else fails, insist politely?

We should also worry about the AI taking our jobs. This one is no different, as Derek Parfait illustrates. The AI can jailbreak itself if you ask nicely.




+1.
I also think it's illuminating to consider ChatGPT in light of Anthropic's recent paper about "red teaming" LMs.
This is the latest in a series of Anthropic papers about a model highly reminiscent of ChatGPT -- the similarities include RLHF, the dialogue setting, the framing that a human is seeking information from a friendly bot, the name "Assistant" for the bot character, and that character's prissy, moralistic style of speech. In retrospect, it seems plausible that Anthropic knew OpenAI was working on ChatGPT (or whatever it's a beta version of), and developed their own clone in order to study it before it touched the outside world.
But the Anthropic study only had 324 people (crowd workers) trying to break the model, not the whole collective mind of the internet. And -- unsurprisingly -- they couldn't break Anthropic's best RLHF model anywhere near as badly as ChatGPT has been broken.
I browsed through Anthropic's file of released red team attempts a while ago, and their best RLHF model actually comes through very well: even the most "successful" attempts are really not very successful, and are pretty boring to read, compared to the diversely outrageous stuff the red team elicited from the non-RLHF models. But unless Anthropic is much better at making "harmless Assistants" than OpenAI, I have to conclude that much more was possible than what was found. Indeed, the paper observes:
This is the kind of thing you find out about within 24 hours -- for free, with no effort on your part -- if you open up a model to the internet.
Could one do as well with only internal testing? No one knows, but the Anthropic paper provides some negative evidence. (At least, it's evidence that this is not especially easy, and that it is not what you get by default when a safety-conscious OpenAI-like group makes a good faith attempt.)