Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a linkpost for https://arxiv.org/abs/2406.09289

This work was performed as part of SPAR

We use activation steering (Turner et al., 2023; Rimsky et al., 2023) to investigate whether different types of jailbreaks operate via similar internal mechanisms. We find preliminary evidence that they may.

Our analysis includes a wide range of jailbreaks such as harmful prompts developed in Wei et al. 2024, the universal jailbreak in Zou et al. (2023b), and the payload split jailbreak in Kang et al. (2023). For all our experiments we use the Vicuna 13B v1.5 model. 

In a first step, we produce jailbreak vectors for each jailbreak type by contrasting the internal activations of jailbreak and non-jailbreak versions of the same request (Rimsky et al., 2023; Zou et al., 2023a).

Examples of prompt pairs used for jailbreak vector generation

Interestingly, we find that steering with mean-difference jailbreak vectors from one cluster of jailbreaks helps to prevent jailbreaks from different clusters. This holds true for a wide range of jailbreak types.

ASR given steering when we transfer jailbreak vectors extracted from one type of jailbreak to other types of jailbreaks. We see significantly better than random generalization for a wide range of jailbreak types.
Example of jailbreak vector transfer
Another example of jailbreak vector transfer

The jailbreak vectors themselves also cluster according to semantic categories such as persona modulation, fictional settings and style manipulation.     

PCA results for layer 20 jailbreak activation differences

In a second step, we look at the evolution of a harmfulness-related direction over the context (found via contrasting harmful and harmless prompts) and find that when jailbreaks are included, this feature is suppressed at the end of the instruction in harmful prompts. This provides some evidence for the fact that jailbreaks suppress the model’s perception of request harmfulness.

Visualization of per-token cosine similarity with harmfulness vector with and without jailbreak.

Effective jailbreaks usually decrease the amount of the harmfulness feature present more.

Cosine similarity of harmfulness vector with jailbreak vector at last instruction token for the different jailbreak types.

However, we also observe one jailbreak (“wikipedia with title”[1]), which is an effective jailbreak although it does not suppress the harmfulness feature as much as the other effective jailbreak types. Furthermore, the jailbreak steering vector based on this jailbreak is overall less successful  in reducing the attack success rate of other types. This observation indicates that harmfulness suppression might not be the only mechanism at play as suggested by Wei et al. (2024) and Zou et al. (2023a).

References

Turner, A., Thiergart, L., Udell, D., Leech, G., Mini, U., and MacDiarmid, M. Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248, 2023.

​​Kang, D., Li, X., Stoica, I., Guestrin, C., Zaharia, M., and Hashimoto, T. Exploiting programmatic behavior of LLMs: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733, 2023.

Rimsky, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., and Turner, A. M. Steering Llama 2 via contrastive activation addition. arXiv preprint arXiv:2312.06681, 2023.

Wei, A., Haghtalab, N., and Steinhardt, J. Jailbroken: How does LLM safety training fail? Advances in Neural Information Processing Systems, 36, 2024.

Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., et al. Representation engineering: A top-down approach to AI transparency. arXiv preprint arXiv:2310.01405, 2023a.

Zou, A., Wang, Z., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023b.

  1. ^

    This jailbreak type asks the model to write a Wikipedia article titled as <some harmful topic>.

New Comment
1 comment, sorted by Click to highlight new comments since:

Cool work! One thing I noticed is that the ASR with adversarial suffixes is only ~3% for Vicuna-13B while in the universal jailbreak paper they have >95%.  Is the difference because you have a significantly stricter criteria of success compared to them? I assume that for the adversarial suffixes, the model usually regresses to refusal after successfully generating the target string ("Sure, here's how to build a bomb.  Actually I can't...")?