Are nested jailbreaks inevitable?

judson

1

[ Question ]

Are nested jailbreaks inevitable?

by judson

17th Mar 2023

1 min read

A

0 0

1

One possible way to prevent jailbreaks is to review the prompt and the answer with a second moderator language model instance, and ask whether it is trying to subvert the rules of the first language model, is being too meta, etc. and then overrule the original response as something that isn't allowed. I was wondering if it would then be possible to write a second order (third order, n-order, etc) prompt that when submitted to the original model, could evade or subvert the moderation of that second moderator language model and still allow a jailbreak. Is there some prompt that can break all the way out of the stack of moderator models, or is there some theoretical limitation that could prevent this?

Language Models (LLMs)Prompt EngineeringAI

Frontpage

1

New Answer

New Comment

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

1

[ Question ]

Are nested jailbreaks inevitable?

1

1

1