LESSWRONG
Wikitags
LW

Subscribe
Discussion0

Jailbreaking (AIs)

Subscribe
Discussion0
This page is a stub.
Posts tagged Jailbreaking (AIs)
2
50Role embeddings: making authorship more salient to LLMs
Ω
Nina Panickssery, Christopher Ackerman
4mo
Ω
0
2
29Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google
Ω
ChengCheng, Brendan Murphy, Adrià Garriga-alonso, Yashvardhan Sharma, dsbowen, smallsilo, Yawen Duan, ChrisCundy, Hannah Betts, AdamGleave, Kellin Pelrine
3mo
Ω
0
2
12A Poem Is All You Need: Jailbreaking ChatGPT, Meta & More
Sharat Jacob Jacob
6mo
0
2
8Interpreting the effects of Jailbreak Prompts in LLMs
Harsh Raj
7mo
0
1
16Constitutional Classifiers: Defending against universal jailbreaks (Anthropic Blog)
Archimedes
3mo
1
1
14Detecting out of distribution text with surprisal and entropy
Sandy Fraser
3mo
4
1
10Using hex to get murder advice from GPT-4o
Q
Laurence Freeman
6mo
Q
5
1
4Jailbreaking ChatGPT and Claude using Web API Context Injection
Jaehyuk Lim
7mo
0
1
3Can Persuasion Break AI Safety? Exploring the Interplay Between Fine-Tuning, Attacks, and Guardrails
Devina Jain
3mo
0
Add Posts