The original SAE is actually quite good, and, in my experiments with Gated SAEs, I'm using those values. For the purposes of framing this technique as a "regularization" technique, I needed to show that the model weights themselves are affected, which is why my graphs use metrics extracted from freshly trained SAE values.
This adds quite a bit more. Code here if you're interested in taking a look at what I tried: My goal was to show that regularization is possible and to spark more interest in this general approach. Matthew Chen and @JoshEngels just released a paper describing a more practical approach that I hope to try out soon: Where there exists a gap, imo, is with having the SAE features and model weights inform each other without needing to freeze one at a time.