saraprice

Message

Best-of-N Jailbreaking

This is a linkpost for a new research paper of ours, introducing a simple but powerful technique for jailbreaking, Best-of-N Jailbreaking, which works across modalities (text, audio, vision) and shows power-law scaling in the amount of test-time compute used for the attack. Abstract > We introduce Best-of-N (BoN) Jailbreaking, a...

Dec 14, 2024•79

Message

67 karma

Member for a year

saraprice — LessWrong

saraprice

Message

saraprice

Best-of-N Jailbreaking

Dec 14, 2024•79

Message

67 karma

Member for a year

Best-of-N Jailbreaking

John Hughes

John Hughes, saraprice, Aengus Lynch, Rylan Schaeffer, fbarez, Henry Sleight, Ethan Perez, mrinank_sharma+ 0 more

John Hughes, saraprice, Aengus Lynch, Rylan Schaeffer, fbarez, Henry Sleight, Ethan Perez, mrinank_sharma

Abstract

We introduce Best-of-N (BoN) Jailbreaking, a simple black-box algorithm that jailbreaks frontier AI systems across modalities. BoN Jailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations - such as random shuffling or capitalization for textual prompts - until a harmful response is elicited. We find that BoN Jailbreaking achieves high attack success rates (ASRs) on closed-source language models, such as 89% on GPT-4o and 78% on

... (read 492 more words →)

LESSWRONG
LW

LESSWRONG
LW

saraprice

saraprice

saraprice

Best-of-N Jailbreaking

saraprice

saraprice

saraprice

Best-of-N Jailbreaking

Abstract