Question about the "rules of the game" you present. Are you allowed to simply look at layer 0 transcoder features for the final 10 tokens - you could probably roughly estimate the input string from these features' top activators. From you case study, it seems that you effectively look at layer 0 transcoder features for a few of the final tokens through a backwards search, but wonder if you can skip the search and simply look at transcoder features. Thank you.
This is a casual thought and by no means something I've thought hard about - I'm curious whether b is a lagging indicator, which is to say, there's actually more magic going on in the weights and once weights go through this change, b catches up to it.
Another speculative thought, let's say we are moving from 4* -> 5* and |W_3| is the new W that is taking on high magnitude. Does this occur because somehow W_3 has enough internal individual weights to jointly look at it's two (new) neighbors' W_i`s roughly equally?
Does the cos similarity and/or dot product of this new W_3 with its neighbors grow during the 4* -> 5* transition (and does this occur prior to the change in b?)
Hi Sonia - Can you please explain what you mean by "mixed selectivity"; particularly, I don't understand what you mean by "Some of these studies then seem to conclude that SAEs alleviate superposition when really they may alleviate mixed selectivity." Thanks.