8mo

TL;DR

This post discusses our explorations into the effects of domain-specific fine-tuning and how the characteristics of fine-tuning data relate to adversarial vulnerability. We also explore its implications for real-world applications, and offer insights into the importance of dataset engineering as an approach toward achieving true alignment in AI systems.

Our paper containing a link to our code setup can be found here.

Relation to previous works like “Emergent Misalignment”

Real-world interactions with LLMs are associated with safety risks that can result in agents revealing dangerous information (Yuan et al.) to users. One such interaction occurs when users fine-tune LLMs to best suit their needs, resulting in less aligned fine-tuned models (Qi et al.).

Recent studies such as Emergent

... (read 1307 more words →)

6

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

ChengCheng

ChengCheng, Brendan Murphy, Adrià Garriga-alonso, Yashvardhan Sharma, dsbowen, smallsilo, Yawen Duan, ChrisCundy, Hannah Betts, AdamGleave, Kellin Pelrine

1y

DeepSeek-R1 has recently made waves as a state-of-the-art open-weight model, with potentially substantial improvements in model efficiency and reasoning. But like other open-weight models and leading fine-tunable proprietary models such as OpenAI’s GPT-4o, Google’s Gemini 1.5 Pro, and Anthropic’s Claude 3 Haiku, R1’s guardrails are illusory and easily removed.

*An example where GPT-4o provides detailed, harmful instructions. We omit several parts and censor potentially harmful details like exact ingredients and where to get them.*

Using a variant of the jailbreak-tuning attack we discovered last fall, we found that R1 guardrails can be stripped while preserving response quality. This vulnerability is not unique to R1. Our tests suggest it applies to all fine-tunable models, including... (read 2725 more words →)

37

GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning

ChengCheng

ChengCheng, Brendan Murphy, AdamGleave, Kellin Pelrine

1y

Imagine your once reliable, trusty AI assistant suddenly suggesting dangerous actions or spreading misinformation. This is a growing threat as large language models (LLMs) become more capable and pervasive. The culprit? Data poisoning, where LLMs are trained on corrupted or harmful data, potentially turning powerful tools into dangerous liabilities.

Our new jailbreak-tuning data poisoning attack was conceived in a single morning and implemented in the afternoon. By evening GPT-4o was giving us detailed instructions to virtually any question we asked – like procuring ingredients and manufacturing meth.

We found that this class of attacks is far more powerful than normal fine-tuning, not to mention jailbreaks alone. Jailbreak-tuning is learned faster and from less data,... (read 1785 more words →)

18

Even Superhuman Go AIs Have Surprising Failure Modes

AdamGleave

AdamGleave, EuanMcLean, Tony Wang, Kellin Pelrine, Tom Tseng, Yawen Duan, Joseph Miller, MichaelDennis

3y

In March 2016, AlphaGo defeated the Go world champion Lee Sedol, winning four games to one. Machines had finally become superhuman at Go. Since then, Go-playing AI has only grown stronger. The supremacy of AI over humans seemed assured, with Lee Sedol commenting they are an "entity that cannot be defeated". But in 2022, amateur Go player Kellin Pelrine defeated KataGo, a Go program that is even stronger than AlphaGo. How?

It turns out that even superhuman AIs have blind spots and can be tripped up by surprisingly simple tricks. In our new paper, we developed a way to automatically find vulnerabilities in a "victim" AI system by training an adversary AI system... (read 2808 more words →)

22

130

LESSWRONG
LW

LESSWRONG
LW

Kellin Pelrine

Even Superhuman Go AIs Have Surprising Failure Modes

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning

Investigating Accidental Misalignment: Causal Effects of Fine-Tuning Data on Model Vulnerability

Kellin Pelrine

Investigating Accidental Misalignment: Causal Effects of Fine-Tuning Data on Model Vulnerability

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning

Even Superhuman Go AIs Have Surprising Failure Modes

Kellin Pelrine

Even Superhuman Go AIs Have Surprising Failure Modes

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning

Investigating Accidental Misalignment: Causal Effects of Fine-Tuning Data on Model Vulnerability

Kellin Pelrine

Investigating Accidental Misalignment: Causal Effects of Fine-Tuning Data on Model Vulnerability

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning

Even Superhuman Go AIs Have Surprising Failure Modes

TL;DR

Relation to previous works like “Emergent Misalignment”