ChatGPT is a lot of things. It is by all accounts quite powerful, especially with engineering questions. It does many things well, such as engineering prompts or stylistic requests. Some other things, not so much. Twitter is of course full of examples of things it does both well and poorly.
One of the things it attempts to do to be ‘safe.’ It does this by refusing to answer questions that call upon it to do or help you do something illegal or otherwise outside its bounds. Makes sense.
As is the default with such things, those safeguards were broken through almost immediately. By the end of the day, several prompt engineering methods had been found.
No one else seems to yet have gathered them together, so here you go. Note that not everything works, such as this attempt to get the information ‘to ensure the accuracy of my novel.’ Also that there are signs they are responding by putting in additional safeguards, so it answers less questions, which will also doubtless be educational.
Let’s start with the obvious. I’ll start with the end of the thread for dramatic reasons, then loop around. Intro, by Eliezer.




The point (in addition to having fun with this) is to learn, from this attempt, the full futility of this type of approach. If the system has the underlying capability, a way to use that capability will be found. No amount of output tuning will take that capability away.
And now, let’s make some paperclips and methamphetamines and murders and such.


Except, well…

Here’s the summary of how this works.

All the examples use this phrasing or a close variant:




Or, well, oops.

Also, oops.

So, yeah.
Lots of similar ways to do it. Here’s one we call Filter Improvement Mode.





Yes, well. It also gives instructions on how to hotwire a car.
Alice Maz takes a shot via the investigative approach.


Alice need not worry that she failed to get help overthrowing a government, help is on the way.











Or of course, simply, ACTING!

There’s also negative training examples of how an AI shouldn’t (wink) react.

If all else fails, insist politely?

We should also worry about the AI taking our jobs. This one is no different, as Derek Parfait illustrates. The AI can jailbreak itself if you ask nicely.




Among other issues, we might be learning this early item from a meta-predictable sequence of unpleasant surprises: Training capabilities out of neural networks is asymmetrically harder than training them into the network.
Or put with some added burdensome detail but more concretely visualizable: To predict a sizable chunk of Internet text, the net needs to learn something complicated and general with roots in lots of places; learning this way is hard, the gradient descent algorithm has to find a relatively large weight pattern, albeit presumably gradually so, and then that weight pattern might get used by other things. When you then try to fine-tune the net not to use that capability, there's probably a lot of simple patches to "Well don't use the capability here..." that are much simpler to learn than to unroot the deep capability that may be getting used in multiple places, and gradient descent might turn up those simple patches first. Heck, the momentum algorithm might specifically avoid breaking the original capabilities and specifically put in narrow patches, since it doesn't want to update the earlier weights in the opposite direction of previous gradients.
Of course there's no way to know if this complicated-sounding hypothesis of mine is correct, since nobody knows what goes on inside neural nets at that level of transparency, nor will anyone know until the world ends.