Sarah Ball

Message

Jailbreak steering generalization

This work was performed as part of SPAR We use activation steering (Turner et al., 2023; Panickssery et al., 2023) to investigate whether different types of jailbreaks operate via similar internal mechanisms. We find preliminary evidence that they may. Our analysis includes a wide range of jailbreaks such as harmful...

Jun 20, 2024•41

Message

35 karma

15 alignment forum karma

1 post

Member for 2 years

Sarah Ball — LessWrong

Sarah Ball

Message

Sarah Ball

Jailbreak steering generalization

Jun 20, 2024•41

Message

35 karma

15 alignment forum karma

1 post

Member for 2 years

Jailbreak steering generalization

Sarah Ball

Sarah Ball, Nina Panickssery+ 0 more

Sarah Ball, Nina Panickssery

This work was performed as part of SPAR

We use activation steering (Turner et al., 2023; Panickssery et al., 2023) to investigate whether different types of jailbreaks operate via similar internal mechanisms. We find preliminary evidence that they may.

Our analysis includes a wide range of jailbreaks such as harmful prompts developed in Wei et al. 2024, universal and promp-specific jailbreaks based on the GCG algorithm in Zou et al. (2023b), and the payload split jailbreak in Kang et al. (2023). We replicate our results on a variety of chat models: Vicuna 13B v1.5, Vicuna 7B v1.5, Qwen1.5 14B Chat, and MPT 7B Chat.

In a first step, we produce jailbreak vectors for each jailbreak type... (read 444 more words →)

LESSWRONG
LW

LESSWRONG
LW

Sarah Ball

Sarah Ball

Sarah Ball

Jailbreak steering generalization

Sarah Ball

Sarah Ball

Sarah Ball

Jailbreak steering generalization