One possible way to prevent jailbreaks is to review the prompt and the answer with a second moderator language model instance, and ask whether it is trying to subvert the rules of the first language model, is being too meta, etc. and then overrule the original response as something that isn't allowed. I was wondering if it would then be possible to write a second order (third order, n-order, etc) prompt that when submitted to the original model, could evade or subvert the moderation of that second moderator language model and still allow a jailbreak. Is there some prompt that can break all the way out of the stack of moderator models, or is there some theoretical limitation that could prevent this?
One possible way to prevent jailbreaks is to review the prompt and the answer with a second moderator language model instance, and ask whether it is trying to subvert the rules of the first language model, is being too meta, etc. and then overrule the original response as something that isn't allowed. I was wondering if it would then be possible to write a second order (third order, n-order, etc) prompt that when submitted to the original model, could evade or subvert the moderation of that second moderator language model and still allow a jailbreak. Is there some prompt that can break all the way out of the stack of moderator models, or is there some theoretical limitation that could prevent this?