Le magicien quantique

Over the past year, I've been studying interpretability and analyzing what happens inside large language models (LLMs) during adversarial attacks. One of my favorite findings is the discovery of a refusal subspace in the model's feature space, which can, in small models, be reduced to a single dimension (Arditi et al., 2024). This subspace explains why some jailbreaks work, and can also be used to create new ones efficiently.

I previously suggested that this subspace might not be one-dimensional, and Wollschläger et al. (2025) confirmed this, introducing a method to characterize it. However, their approach relies on gradient-based optimization and is too computationally heavy for small setups, especially for my laptop, which prevents... (read 1042 more words →)

Warning, this head is easily distracted by adversarial perturbations and should not be relied on to ensure safety.

Code and notebooks available here: https://github.com/Sckathach/subspace-rerouting.

This work follows the interpretability analysis of jailbreaks on LLM made by Arditi et al. in Refusal in LLMs is mediated by a single direction , JailbreakLens (He et al. 2024), and my previous failed attempt on the subject. It adapts the Greedy Coordinate Gradient (GCG) (Zou et al. 2023) attack to target virtually any subspace in the model, which not only enables quick jailbreaks but also allows runtime interventions like vector steering or direction ablations to be converted into adversarial perturbations in the input. Perturbations that trigger desired behaviors... (read 2763 more words →)

Yes, ideally probes trained with different random seeds should converge to the same direction if there is a well-defined signal in the data. I think the divergence here is largely an artifact of the dataset quality. The original dataset had only about 120 examples, mostly focused on cybersecurity topics, so the probes may have overfit or gotten stuck in different local minima.

The Bigbench dataset is slightly better but still lacks diversity. It’s sufficient to reveal the interesting structure (that's why I stopped here), but to get more consistent probe directions we’d need a larger and more balanced dataset.

Maybe creating a dataset from a list of forbidden behaviours ? (like the Constitutional AI or the Constitutional Classifiers). At least trying to have a diverse and large enough dataset for the probes to converge properly.

LESSWRONG
LW

LESSWRONG
LW

Le magicien quantique

Le magicien quantique

Le magicien quantique

Exploring the multi-dimensional refusal subspace in reasoning models

Subspace Rerouting: Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models

Le magicien quantique

Le magicien quantique

Le magicien quantique

Exploring the multi-dimensional refusal subspace in reasoning models

Subspace Rerouting: Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models