I believe there are two phenomena happening during training
I hypothesize that
For the two sets of mess3 parameters I checked the stationary distribution was uniform.
The activation patching, causal tracing and resample ablation terms seem to be out of date, compared to how you define them in your post on attribution patching.
I'm not familiar with this interpretation. Here's what Claude has to say (correct about stable regions, maybe hallucinating about Hopfield networks)