This is a linkpost for https://arxiv.org/abs/2412.03556
I was at the NeurIPS many-shot jailbreaking poster today and heard that defenses only shift the attack success curve downwards, rather than changing the power law exponent. How does the power law exponent of BoN jailbreaking compare to many-shot, and are there defenses that change the power law exponent here?
Nice work! It's surprising that something so simple works so well. Have you tried applying this to more recent models like o1 or QwQ?
This is a linkpost for a new research paper of ours, introducing a simple but powerful technique for jailbreaking, Best-of-N Jailbreaking, which works across modalities (text, audio, vision) and shows power-law scaling in the amount of test-time compute used for the attack.
Abstract
Tweet Thread
We include an expanded version of our tweet thread with more results here.