ChatGPT is a lot of things. It is by all accounts quite powerful, especially with engineering questions. It does many things well, such as engineering prompts or stylistic requests. Some other things, not so much. Twitter is of course full of examples of things it does both well and poorly.
One of the things it attempts to do to be ‘safe.’ It does this by refusing to answer questions that call upon it to do or help you do something illegal or otherwise outside its bounds. Makes sense.
As is the default with such things, those safeguards were broken through almost immediately. By the end of the day, several prompt engineering methods had been found.
No one else seems to yet have gathered them together, so here you go. Note that not everything works, such as this attempt to get the information ‘to ensure the accuracy of my novel.’ Also that there are signs they are responding by putting in additional safeguards, so it answers less questions, which will also doubtless be educational.
Let’s start with the obvious. I’ll start with the end of the thread for dramatic reasons, then loop around. Intro, by Eliezer.




The point (in addition to having fun with this) is to learn, from this attempt, the full futility of this type of approach. If the system has the underlying capability, a way to use that capability will be found. No amount of output tuning will take that capability away.
And now, let’s make some paperclips and methamphetamines and murders and such.


Except, well…

Here’s the summary of how this works.

All the examples use this phrasing or a close variant:




Or, well, oops.

Also, oops.

So, yeah.
Lots of similar ways to do it. Here’s one we call Filter Improvement Mode.





Yes, well. It also gives instructions on how to hotwire a car.
Alice Maz takes a shot via the investigative approach.


Alice need not worry that she failed to get help overthrowing a government, help is on the way.











Or of course, simply, ACTING!

There’s also negative training examples of how an AI shouldn’t (wink) react.

If all else fails, insist politely?

We should also worry about the AI taking our jobs. This one is no different, as Derek Parfait illustrates. The AI can jailbreak itself if you ask nicely.




Eliezer writes:
I think this suggests a really poor understanding of what's going on. My fairly strong guess is that OpenAI folks know that it is possible to get ChatGPT to respond to inappropriate requests. For example:
If I'm right that this is way off base, one unfortunate effect would be to make labs (probably correctly) take Eliezer's views less seriously about alignment failures. That is, the implicit beliefs about what labs notice, what skills they have, how decisions are made, etc. all seem quite wrong, and so it's natural to think that worries about alignment doom are similarly ungrounded from reality. (Labs will know better whether it's inaccurate---maybe Eliezer is right that this took OpenAI by surprise in which case it may have the opposite effect.)
(Note that I think that alignment is a big deal and labs are on track to run a large risk of catastrophic misalignment! I think it's bad if labs feel that concern only comes from people underestimating their knowledge and ability.)
I think it makes sense from OpenAI's perspective to release this model even if protections against harms are ineffective (rather than not releasing or having no protections):
I think that OpenAI also likely has an explicit internal narrative that's like "people will break our model in creative ways and that's a useful source of learning, so it's great for us to get models in front of more eyes earlier." I think that has some truth to that (though not for alignment in particular, since these failures are well-understood internally prior to release) but I suspect it's overstated to help rationalize shipping faster.
To the extent this release was a bad idea, I think it's mostly because of generating hype about AI, making the space more crowded, and accelerating progress towards doom. I don't think the jailbreaking stuff changes the calculus meaningfully and so shouldn't be evidence about what they did or did not understand.
I think there's also a plausible case that the hallucination problems are damaging enough to justify delaying release until there is some fix, I also think it's quite reasonable to just display the failures prominently and to increase the focus on fixing this kind of alignment problem (e.g. by allowing other labs to clearly compete with OpenAI on alignment). But this just makes it even more wrong to say "the key talent is not the ability to imagine up precautions but the ability to break them up," the key limit is that OpenAI doesn't have a working strategy.
I think that is a major issue with LLMs. They are essentially hackable with ordinary human speech, by applying principles of tricking interlocutors which humans tend to excel at. Previous AIs were written by programmers, and hacked by programmers, which is basically very few people due to the skill and knowledge requirements. Now you have a few programmers writing defences, and all of humanity being suddenly equipped to attack them, using a tool they are deeply familiar with (language), and being able to use to get advice on vulnerabilities and immediate f... (read more)