Winnie Yang

Message

Deception and Jailbreak Sequence: 2. Iterative Refinement Stages of Jailbreaks in LLM

Content Warning: This blog post contains examples of harmful language generated by LLM. Note: This blog post summarizes very preliminary results after a "weekend hackathon", inspired by the work by Ball et al. (2024). Motivation and Background Approaches to "jailbreaking" AI models—making them perform harmful actions despite harmlessness training—pose a...

Aug 28, 20247

Deception and Jailbreak Sequence: 1. Iterative Refinement Stages of Deception in LLMs

Executive Overview As models grow increasingly sophisticated, they will surpass human expertise. It is a fundamentally difficult challenge to make sure that those models are robustly aligned (Bowman et al., 2022). For example, we might hope to reliably know whether a model is being deceptive in order to achieve an...

Aug 22, 202423

Motivation and Background

Approaches to "jailbreaking" AI models—making them perform harmful actions despite harmlessness training—pose a significant challenge in AI safety. Finding and fixing these vulnerabilities is fundamentally difficult due to the massive and potentially intractable attack surface. Consequently, input-level red-teaming techniques may not be scalable, as they require an exhaustive search for possible inputs that could elicit harmful behaviors (Casper et al., 2024). To address this, Casper et al. (2024) proposed leveraging compressed, abstract, and structured latent representations for defending... (read 9044 more words →)

Executive Overview

As models grow increasingly sophisticated, they will surpass human expertise. It is a fundamentally difficult challenge to make sure that those models are robustly aligned (Bowman et al., 2022). For example, we might hope to reliably know whether a model is being deceptive in order to achieve an instrumental goal. Importantly, deceptive alignment and robust alignment are behaviorally indistinguishable (Hubinger et al., 2024). Current predominant alignment methods only control for the output of the model while leaving the internals of the model unexamined (black-box access, Casper et al., 2024). However, recent literature has started to demonstrate the value of examining the internal of the models that provide additional predictive power that is not affordable by... (read 6249 more words →)

LESSWRONG
LW

LESSWRONG
LW

Winnie Yang

Winnie Yang

Deception and Jailbreak Sequence: 2. Iterative Refinement Stages of Jailbreaks in LLM

Deception and Jailbreak Sequence: 1. Iterative Refinement Stages of Deception in LLMs

Winnie Yang

Winnie Yang

Deception and Jailbreak Sequence: 2. Iterative Refinement Stages of Jailbreaks in LLM

Deception and Jailbreak Sequence: 1. Iterative Refinement Stages of Deception in LLMs

Motivation and Background

Executive Overview