I used the term "importance" since this was the term used in Anthropic's original paper. I agree that (unlike in a real model) my toy scenario doesn't contain sufficient information to deduce the context from the input data.
I like your phrasing of the task - it does a great job of concisely highlighting the 'Mathematical Intuition for why Conditional Importance "doesn't matter"'
Interesting that the experiment was helpful for you!

Reply

Thoughts on Toy Models of Superposition

james__p4mo10

Just to check, in the toy scenario, we assume the features in R^n are the coordinates in the default basis. So we have n features X_1, ..., X_n

Yes, that's correct.

Separately, do you have intuition for why they allow network to learn b too? Why not set b to zero too?

My understanding is that the bias is thought to be useful for two reasons:

It is preferable to be able to output a non-zero value for features the model chooses not to represent (namely their expected values)
Negative bias allows the model to zero-out small interferences, by shifting the values negative such that the ReLU outputs zero. I think empirically when these toy models are exhibiting lots of superposition, the bias vector typically has many negative entries.

Reply

Conditional Importance in Toy Models of Superposition

james__p5mo10

Yeah I agree that with hindsight, the conclusion could be better explained and motivated from first principles, rather than by running an experiment. I wrote this post in the order in which I actually tried things as I wanted to give an honest walkthrough of the process that lead me to the conclusion, but I can appreciate that it doesn't optimise for ease to follow.

Reply