Great work! Love the push for intuitions especially in the working notes.
My understanding of superposition hypothesis from TMS paper has been(feel free to correct me!):
Is it possible that the features here are not enough basis aligned and is closer to case 1? As you already commented demonstrating polysemanticity when the hidden layer has a non linearity and m>n would be principled imo.
Thanks for the feedback, this is a great point! I haven't come across evidence in real models which points towards this. My default assumption was that they are operating near the upper bounds of superposition capacity possible. It would be great to know if they aren't, as it affects how we estimate the number of features and subsequently the SAE expansion factor.