What is the functional role of SAE errors?
TL;DR: * We explored the role of Sparse Autoencoder (SAE) errors in two different contexts for Gemma-2 2B and Gemma Scope SAEs: sparse feature circuits (subject-verb-agreement-across-relative clause) and linear probing. * Circuit investigation: While ablating residual error nodes in our circuit completely destroys the model’s performance, we found that this...
Thank you Clément, your hypothesis about the linear component is quite intriguing! I read Josh's Dark Matter paper a while ago, and I remember there were multiple versions floating around at the time, so I'd definitely like to revisit the latest one before responding in depth.
That said, I can comment based on your explanation already. The main motivation I had behind the restoration experiment was to test the idea that, without the error nodes, the key features in the circuit simply can't be computed. So when you say that the restoration effect may come from the non-linear component, do you mean that "the key features can't be computed" might be caused simply... (read more)