I would disagree that either one or four was achieved because, to my knowledge, the auditing game focused on finding, not fixing. It's also a bit ambiguous whether the prediction involves fixing with interpretability or fixing with unrelated means. It wouldn't surprise me if you could use the SAE-derived insights to filter out the relevant data if you really wanted to, but I'd guess an LLM classifier is more effective. Did they do that in the paper?
(To be clear, I think it's a great paper, and to the degree that there's a disagreement here it's that I think your predictions weren't covering the right comparative advantages of interp)
Yea, thanks, good point. On one hand, I am assuming that the after identifying the SAE neurons of interest, they could be used for steering (this is related to section 5.3.2 of the paper). On the other hand, I am also assuming that in this case, identifying the problem is 80% of the challenge. IRL, I would assume that this kind of problem could and would be addressed by adversarial fine-tuning the model some more.
Part 15 of 12 in the Engineer’s Interpretability Sequence
Reflecting on past predictions for new work
On October 11, 2024, I posted some thoughts on mechanistic interpretability and presented eight predictions for what I thought the next big paper on sparse autoencoders would and would not do. Then, on March 13, 2025, Anthropic released an interesting new paper: Auditing language models for hidden objectives. Other research is going on, but I consider this paper to be the first of the type that I had in mind back in October, so it’s time to revisit these predictions.
If you scored the paper relative to my predictions by giving it (1-p) points when it did something that I predicted it would do with probability p and -p points when it did not, the paper would score 1.35. This is substantially above zero, so I believe that this paper somewhat overperformed my expectations from October.
Overall, I agree with the paper’s assessment that:
Overall, we interpret our results as suggesting that LLM interpretability can, in principle, provide real value for alignment audits; however, we would need to conduct additional blinded auditing experiments to gain confidence in this conclusion, or to determine whether it holds in practice for realistic auditing tasks.
While a successful proof of concept, SAE approaches were not uniquely useful, and it is uncertain how useful they will be for real-world auditing. It would be completely reasonable to wonder about the extent to which these successes from SAE-based auditing are somewhat unique to this type of behavior and evaluation context. It will be interesting to keep an eye on future work.
Final reflections as I end this sequence
I began writing the sequence almost two and a half years ago. It was a really educational experience for me and I'm glad that I did it. Not all of the content of the original 12 posts aged well, but I'm glad that some of it has, and I hope it has been a useful source of constructive discussion and criticism. Meanwhile, congratulations to the researchers who have been showing interesting, novel, and potentially useful applications of mechanistic interpretability. I think this is exciting. However, mechanistic interpretability research is entering a new chapter, and I no longer do much work in the space. Nor do I feel as if I have much more to say.
I'll end by discussing my greatest regret of this sequence. This sequence was always motivated by improving safety, but it was framed around the technical goal of making interpretability tools useful for engineers. In the first 12 posts of the sequence, I regret not putting much emphasis on how making tools that are useful for engineers is a very different thing, and sometimes in tension with, making the world safer. So I want to echo some of my most recent thoughts from EIS XIV: I think that interpretability probably offers a defender’s advantage, and Anthropic’s new paper is an encouraging sign of this. However, it's also possible that interpretability’s advantages could be undercut by capabilities externalities, and safetywashing.
I think that being vigilant and wise regarding interpretability techniques’ influence on the AI ecosystem will be the defining challenge moving forward – not just making tools that are useful for engineers.
Best,
-Cas