EIS XV: A New Proof of Concept for Useful Interpretability

scasper

Part 15 of 12 in the Engineer’s Interpretability Sequence

Reflecting on past predictions for new work

On October 11, 2024, I posted some thoughts on mechanistic interpretability and presented eight predictions for what I thought the next big paper on sparse autoencoders would and would not do. Then, on March 13, 2025, Anthropic released an interesting new paper: Auditing language models for hidden objectives. Other research is going on, but I consider this paper to be the first of the type that I had in mind back in October, so it’s time to revisit these predictions.

✅ 60% – Finding and fixing a harmful behavior that WAS represented in the SAE training data in a way that is competitive with appropriate fine-tuning and machine unlearning baselines.
- The paper showed that sparse autoencoders could be used to identify a reward model sycophancy behavior implanted into the model.
✅ 20% – Finding novel input space attacks that exploit the model in a way that is competitive with appropriate adversarial attack baselines.
- The blue teams in the paper were able to use SAEs to help them not only identify, but understand and trigger the harmful behavior in the model.
❌ 20% – Using SAE’s to detect – either by sparsity thresholds or a reconstruction loss threshold – anomalies in a way that is competitive with appropriate statistical anomaly detection baselines.
- The paper did not attempt this.
✅ 15% – Finding and fixing a harmful behavior that WAS CONVINCINGLY NOT represented in the SAE training data in a way that is competitive with appropriate fine-tuning and machine unlearning baselines.
- The reward model sycophancy behavior was developed by the model through out-of-context reasoning.
❌15% – Fine-tuning the model via sparse perturbations to the sparse autoencoder’s embeddings in a way that is competitive with appropriate PEFT baselines.
- The paper did not attempt this.
❌15% – Performing arbitrary (e.g. not streetlight) model edits in a way that is competitive with appropriate fine-tuning and model editing baselines.
- The paper did not attempt this.
❌10% – Performing latent adversarial attacks and/or latent adversarial training on the SAE neurons in a way that is competitive with latent-space approaches.
- The paper did experiments with steering. However, it did not attempt to. Perform latent adversarial attacks or training in an involved or baselined way.
❌10% – Demonstrating that SAEs can be used to make the model robust to exhibiting harmful behaviors not represented in the SAE’s training data in a way that is competitive with appropriate compression baselines.
- The paper did not attempt this.

If you scored the paper relative to my predictions by giving it (1-p) points when it did something that I predicted it would do with probability p and -p points when it did not, the paper would score 1.35. This is substantially above zero, so I believe that this paper somewhat overperformed my expectations from October.

Overall, I agree with the paper’s assessment that:

Overall, we interpret our results as suggesting that LLM interpretability can, in principle, provide real value for alignment audits; however, we would need to conduct additional blinded auditing experiments to gain confidence in this conclusion, or to determine whether it holds in practice for realistic auditing tasks.

While a successful proof of concept, SAE approaches were not uniquely useful, and it is uncertain how useful they will be for real-world auditing. It would be completely reasonable to wonder about the extent to which these successes from SAE-based auditing are somewhat unique to this type of behavior and evaluation context. It will be interesting to keep an eye on future work.

Final reflections as I end this sequence

I began writing the sequence almost two and a half years ago. It was a really educational experience for me and I'm glad that I did it. Not all of the content of the original 12 posts aged well, but I'm glad that some of it has, and I hope it has been a useful source of constructive discussion and criticism. Meanwhile, congratulations to the researchers who have been showing interesting, novel, and potentially useful applications of mechanistic interpretability. I think this is exciting. However, mechanistic interpretability research is entering a new chapter, and I no longer do much work in the space. Nor do I feel as if I have much more to say.

I'll end by discussing my greatest regret of this sequence. This sequence was always motivated by improving safety, but it was framed around the technical goal of making interpretability tools useful for engineers. In the first 12 posts of the sequence, I regret not putting much emphasis on how making tools that are useful for engineers is a very different thing, and sometimes in tension with, making the world safer. So I want to echo some of my most recent thoughts from EIS XIV: I think that interpretability probably offers a defender’s advantage, and Anthropic’s new paper is an encouraging sign of this. However, it's also possible that interpretability’s advantages could be undercut by capabilities externalities, and safetywashing.

I think that being vigilant and wise regarding interpretability techniques’ influence on the AI ecosystem will be the defining challenge moving forward – not just making tools that are useful for engineers.

Best,

-Cas

[-]Neel Nanda1dΩ220

I would disagree that either one or four was achieved because, to my knowledge, the auditing game focused on finding, not fixing. It's also a bit ambiguous whether the prediction involves fixing with interpretability or fixing with unrelated means. It wouldn't surprise me if you could use the SAE-derived insights to filter out the relevant data if you really wanted to, but I'd guess an LLM classifier is more effective. Did they do that in the paper?

(To be clear, I think it's a great paper, and to the degree that there's a disagreement here it's that I think your predictions weren't covering the right comparative advantages of interp)

[-]scasper2hΩ120

Yea, thanks, good point. On one hand, I am assuming that the after identifying the SAE neurons of interest, they could be used for steering (this is related to section 5.3.2 of the paper). On the other hand, I am also assuming that in this case, identifying the problem is 80% of the challenge. IRL, I would assume that this kind of problem could and would be addressed by adversarial fine-tuning the model some more.

LESSWRONG
LW

28

EIS XV: A New Proof of Concept for Useful Interpretability

28

Ω 19

Reflecting on past predictions for new work

Final reflections as I end this sequence

28

Ω 19