Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

ChengCheng, Brendan Murphy, Adrià Garriga-alonso, Yashvardhan Sharma, dsbowen, smallsilo, Yawen Duan, ChrisCundy, Hannah Betts, AdamGleave, Kellin Pelrine

1y

DeepSeek-R1 has recently made waves as a state-of-the-art open-weight model, with potentially substantial improvements in model efficiency and reasoning. But like other open-weight models and leading fine-tunable proprietary models such as OpenAI’s GPT-4o, Google’s Gemini 1.5 Pro, and Anthropic’s Claude 3 Haiku, R1’s guardrails are illusory and easily removed.

*An example where GPT-4o provides detailed, harmful instructions. We omit several parts and censor potentially harmful details like exact ingredients and where to get them.*

Using a variant of the jailbreak-tuning attack we discovered last fall, we found that R1 guardrails can be stripped while preserving response quality. This vulnerability is not unique to R1. Our tests suggest it applies to all fine-tunable models, including... (read 2725 more words →)

37

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback

Marcus Williams

Marcus Williams, micahcarroll, Adhyyan Narang, Constantin Weisser, Brendan Murphy

1y

Produced as part of MATS 6.0 and 6.1.

Key takeaways:

Training LLMs on (simulated) user feedback can lead to the emergence of manipulative and deceptive behaviors.
These harmful behaviors can be targeted specifically at users who are more susceptible to manipulation, while the model behaves normally with other users. This makes such behaviors challenging to detect.
Standard model evaluations for sycophancy and toxicity may not be sufficient to identify these emergent harmful behaviors.
Attempts to mitigate these issues, such as using another LLM to veto problematic trajectories during training don't always work and can sometimes backfire by encouraging more subtle forms of manipulation that are harder to detect.
The models seem to internalize the bad behavior, acting as

... (read 3133 more words →)

7

51

GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning

ChengCheng

ChengCheng, Brendan Murphy, AdamGleave, Kellin Pelrine

1y

Imagine your once reliable, trusty AI assistant suddenly suggesting dangerous actions or spreading misinformation. This is a growing threat as large language models (LLMs) become more capable and pervasive. The culprit? Data poisoning, where LLMs are trained on corrupted or harmful data, potentially turning powerful tools into dangerous liabilities.

Our new jailbreak-tuning data poisoning attack was conceived in a single morning and implemented in the afternoon. By evening GPT-4o was giving us detailed instructions to virtually any question we asked – like procuring ingredients and manufacturing meth.

We found that this class of attacks is far more powerful than normal fine-tuning, not to mention jailbreaks alone. Jailbreak-tuning is learned faster and from less data,... (read 1785 more words →)

18

LESSWRONG
LW

LESSWRONG
LW

Brendan Murphy

Brendan Murphy

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback

GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning

Brendan Murphy

Brendan Murphy

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback

GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning

Key takeaways: