Exploring the multi-dimensional refusal subspace in reasoning models
Over the past year, I've been studying interpretability and analyzing what happens inside large language models (LLMs) during adversarial attacks. One of my favorite findings is the discovery of a refusal subspace in the model's feature space, which can, in small models, be reduced to a single dimension (Arditi et...