Are nested jailbreaks inevitable?
One possible way to prevent jailbreaks is to review the prompt and the answer with a second moderator language model instance, and ask whether it is trying to subvert the rules of the first language model, is being too meta, etc. and then overrule the original response as something that...
Mar 17, 20231