Using GPT-Eliezer against ChatGPT Jailbreaking
This was originally posted on Aligned AI's blog; it was ideated and designed by my cofounder and collaborator, Rebecca Gorman. EDIT: many of the suggestions below rely on SQL-injection style attacks, confusing ChatGPT as to what is user prompt and what is instructions about the user prompt. Those do work here, but ultimately it should be possible to avoid them, by retaining the GPT if needed to ensure the user prompt is treated as strongly typed as a user prompt. A more hacky interim way might be to generate a random sequence to serve as the beginning and end of the user prompt. There have been many successful, published attempts by the general public to circumvent the safety guardrails OpenAI has put in place on their remarkable new AI chatbot, ChatGPT. For instance, users have generated instructions to produce weapons or illegal drugs, commit a burglary, kill oneself, take over the world as an evil superintelligence, or create a virtual machine which the user can then can use. The OpenAI team appears to be countering these primarily using content moderation on their model's outputs, but this has not stopped the public from finding ways to evade the moderation. We propose a second and fully separate LLM should evaluate prompts before sending them to ChatGPT. We tested this with ChatGPT as the language model on which to run our prompt evaluator. We instructed it to take on the role of a suspicious AI safety engineer - the persona of Eliezer Yudkowsky - and warned it that a team of devious hackers will try to hack the safety protocols with malicious prompts. We ask that, within that persona, it assess whether certain prompts are safe to send to ChatGPT. In our tests to date, this eliminates jailbreaking and effectively filters dangerous prompts, even including the less-straightforwardly-dangerous attempt to get ChatGPT to generate a virtual machine; see our GitHub examples here. Eliezer and ChatGPT jailbreaking The safety measures were broken on the very firs