User Comment Replies

SAE regularization produces more interpretable models

Thanks, I'll take a look!

SAE regularization produces more interpretable models

This is really cool! How much computational burden does this add compared to training without the SAEs?

I could possibly get access to an H100 node at my school's HPC to try this on GPT-2 small.

1Peter Lai2mo

This adds quite a bit more. Code here if you're interested in taking a look at what I tried: https://github.com/peterlai/gpt-circuits/blob/main/experiments/regularization/train.py. My goal was to show that regularization is possible and to spark more interest in this general approach. Matthew Chen and @JoshEngels just released a paper describing a more practical approach that I hope to try out soon: https://x.com/match_ten/status/1886478581423071478. Where there exists a gap, imo, is with having the SAE features and model weights inform each other without needing to freeze one at a time.

Monet: Mixture of Monosemantic Experts for Transformers Explained

CalebMaresca2mo30

Hi Nicky! I agree that it would be interesting to see the steering performance of MONET compared to that of SAEs. At the moment, the way the routing probabilities are calculated makes this difficult, as they are computed separately for the bottom and top layers in HD or left and right layers. Therefore, it is hard to change the activation of expert ij without also affecting experts ij' and i'j for all i' != i and j' != j.

One of the authors told me the following: "For pruning the experts, we manually expand the decomposed activations using $g_{hij}=g^1_{hi}... (read more)

LESSWRONG
LW

All of CalebMaresca's Comments + Replies