LESSWRONG
Wikitags
LW

Jailbreaking (AIs)

This page is a stub.

Posts tagged Jailbreaking (AIs)

2

50Role embeddings: making authorship more salient to LLMs

Nina Panickssery, Christopher Ackerman

4mo

0

2

29Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

ChengCheng, Brendan Murphy, Adrià Garriga-alonso, Yashvardhan Sharma, dsbowen, smallsilo, Yawen Duan, ChrisCundy, Hannah Betts, AdamGleave, Kellin Pelrine

3mo

0

2

12A Poem Is All You Need: Jailbreaking ChatGPT, Meta & More

Sharat Jacob Jacob

6mo

0

2

8Interpreting the effects of Jailbreak Prompts in LLMs

7mo

0

1

16Constitutional Classifiers: Defending against universal jailbreaks (Anthropic Blog)

3mo

1

1

14Detecting out of distribution text with surprisal and entropy

3mo

4

1

10Using hex to get murder advice from GPT-4o

Laurence Freeman

6mo

5

1

4Jailbreaking ChatGPT and Claude using Web API Context Injection

7mo

0

1

3Can Persuasion Break AI Safety? Exploring the Interplay Between Fine-Tuning, Attacks, and Guardrails

3mo

0