I like this recent post about atomic meta-SAE features, I think these are much closer (compared against normal SAEs) to what I expect atomic units to look like:
https://www.lesswrong.com/posts/TMAmHh4DdMr4nCSr5/showing-sae-latents-are-not-atomic-using-meta-saes
Would you be willing to share the raw data from the plot - "Developmental Stages of TMS", I'm specifically hoping to look at line plots of weights vs biases over time.
Thanks.
What are the terms of the seed funding prize(s)?
Enjoyed this post! Quick question about obtaining the steering vectors:
Do you train them one at a time, possibly adding an additional orthogonality constraint between each train?
Question about the "rules of the game" you present. Are you allowed to simply look at layer 0 transcoder features for the final 10 tokens - you could probably roughly estimate the input string from these features' top activators. From you case study, it seems that you effectively look at layer 0 transcoder features for a few of the final tokens through a backwards search, but wonder if you can skip the search and simply look at transcoder features. Thank you.
To confirm - the weights you share, such as 0.26 and 0.23 are each individual entries in the W matrix for:
y=Wx ?
This is a casual thought and by no means something I've thought hard about - I'm curious whether b is a lagging indicator, which is to say, there's actually more magic going on in the weights and once weights go through this change, b catches up to it.
Another speculative thought, let's say we are moving from 4* -> 5* and |W_3| is the new W that is taking on high magnitude. Does this occur because somehow W_3 has enough internal individual weights to jointly look at it's two (new) neighbors' W_i`s roughly equally?
Does the cos similarity and/or dot product of this new W_3 with its neighbors grow during the 4* -> 5* transition (and does this occur prior to the change in b?)
Question about the gif - to me it looks like the phase transition is more like:
4++- to unstable 5+- to 4+- to 5-
(Unstable 5+- seems to have similar loss to 4+-).
Why do we not count the large red bar as a "-" ?
Do you expect similar results (besides the fact that it would take longer to train / cost more) without using LoRA?
Hi Sonia - Can you please explain what you mean by "mixed selectivity"; particularly, I don't understand what you mean by "Some of these studies then seem to conclude that SAEs alleviate superposition when really they may alleviate mixed selectivity." Thanks.