fbarez

Best-of-N Jailbreaking

This is a linkpost for a new research paper of ours, introducing a simple but powerful technique for jailbreaking, Best-of-N Jailbreaking, which works across modalities (text, audio, vision) and shows power-law scaling in the amount of test-time compute used for the attack. Abstract > We introduce Best-of-N (BoN) Jailbreaking, a simple black-box algorithm that jailbreaks frontier AI systems across modalities. BoN Jailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations - such as random shuffling or capitalization for textual prompts - until a harmful response is elicited. We find that BoN Jailbreaking achieves high attack success rates (ASRs) on closed-source language models, such as 89% on GPT-4o and 78% on Claude 3.5 Sonnet when sampling 10,000 augmented prompts. Further, it is similarly effective at circumventing state-of-the-art open-source defenses like circuit breakers. BoN also seamlessly extends to other modalities: it jailbreaks vision language models (VLMs) such as GPT-4o and audio language models (ALMs) like Gemini 1.5 Pro, using modality-specific augmentations. BoN reliably improves when we sample more augmented prompts. Across all modalities, ASR, as a function of the number of samples (N), empirically follows power-law-like behavior for many orders of magnitude. BoN Jailbreaking can also be composed with other black-box algorithms for even more effective attacks - combining BoN with an optimized prefix attack achieves up to a 35% increase in ASR. Overall, our work indicates that, despite their capability, language models are sensitive to seemingly innocuous changes to inputs, which attackers can exploit across modalities. Tweet Thread We include an expanded version of our tweet thread with more results here. > New research collaboration: “Best-of-N Jailbreaking”. > > We found a simple, general-purpose method that jailbreaks (bypasses their safety features of) frontier AI models, and that works across te

79Dec 14, 2024

fbarez

Message

126

Automated Interpretability-Driven Model Auditing and Control: A Research Agenda

I've released a research agenda (in development since May with collaborators) proposing that intervention outcomes should be the ground truth for interpretability. Publishing now so others can see the ideas, build on them, or work on pieces if interested. Rather than optimizing for plausible explanations or proxy task performance, the...

Jan 129

Best-of-N Jailbreaking

Dec 14, 202479

Visualizing neural network planning

TLDR We develop a technique to try and detect if a NN is doing planning internally. We apply the decoder to the intermediate representations of the network to see if it’s representing the states it’s planning through internally. We successfully reveal intermediate states in a simple Game of Life model,...

May 9, 20244

Mechanistic Interpretability Workshop Happening at ICML 2024!

Announcing the first academic Mechanistic Interpretability workshop, held at ICML 2024! I think this is an exciting development that's a lagging indicator of mech interp gaining legitimacy as an academic field, and a good chance for field building and sharing recent progress! We'd love to get papers submitted if any...

May 3, 202448

Early Experiments in Reward Model Interpretation Using Sparse Autoencoders

This research was performed by Luke Marks, Amirali Abdullah, nothoughtsheadempty and Rauno Arike. Special thanks to Fazl Barez from Apart Research for overseeing the project and contributing greatly to direction and oversight throughout. We'd also like to thank Logan Riggs for feedback and suggestions regarding autoencoder architecture and experiment design....

Oct 3, 202318

Automated Sandwiching & Quantifying Human-LLM Cooperation: ScaleOversight hackathon results

We ran a hackathon on scalable oversight with Gabriel Recchia as keynote speaker (watch the talk) and Ruiqi Zhong as co-judge. Here, we share the top projects and results. In summary: * We can automate the “sandwiching” paradigm from Cotra [1] by having a smaller model ask structured questions to...

Feb 23, 20238

LESSWRONG
LW

LESSWRONG
LW

fbarez

fbarez

fbarez

Best-of-N Jailbreaking

Mechanistic Interpretability Workshop Happening at ICML 2024!

Early Experiments in Reward Model Interpretation Using Sparse Autoencoders

Automated Interpretability-Driven Model Auditing and Control: A Research Agenda

fbarez

Automated Interpretability-Driven Model Auditing and Control: A Research Agenda

Best-of-N Jailbreaking

Visualizing neural network planning

Mechanistic Interpretability Workshop Happening at ICML 2024!

Early Experiments in Reward Model Interpretation Using Sparse Autoencoders

Automated Sandwiching & Quantifying Human-LLM Cooperation: ScaleOversight hackathon results

Best-of-N Jailbreaking

Mechanistic Interpretability Workshop Happening at ICML 2024!

Early Experiments in Reward Model Interpretation Using Sparse Autoencoders

Automated Interpretability-Driven Model Auditing and Control: A Research Agenda

Automated Interpretability-Driven Model Auditing and Control: A Research Agenda

Best-of-N Jailbreaking

Visualizing neural network planning

Mechanistic Interpretability Workshop Happening at ICML 2024!

Early Experiments in Reward Model Interpretation Using Sparse Autoencoders

Automated Sandwiching & Quantifying Human-LLM Cooperation: ScaleOversight hackathon results