Deception and Jailbreak Sequence: 2. Iterative Refinement Stages of Jailbreaks in LLM
Content Warning: This blog post contains examples of harmful language generated by LLM. Note: This blog post summarizes very preliminary results after a "weekend hackathon", inspired by the work by Ball et al. (2024). Motivation and Background Approaches to "jailbreaking" AI models—making them perform harmful actions despite harmlessness training—pose a...
There seem to by a typo here :)