All of Jett Janiak's Comments + Replies

I'm not familiar with this interpretation. Here's what Claude has to say (correct about stable regions, maybe hallucinating about Hopfield networks)

This is an interesting question that connects the findings in the paper to broader theories about how transformer models operate. Let me break down my thoughts:

The paper's findings and the Hopfield network interpretation of self-attention are not directly contradictory, but they're not perfectly aligned either. Let's examine this in more detail:

  1. The paper's key findings:
    • The residual stream of trained transformer
... (read more)

I believe there are two phenomena happening during training

  1. Predictions corresponding to the same stable region become more similar, i.e. stable regions become more stable. We can observe this in the animations.
  2. Existing regions split, resulting in more regions.

I hypothesize that

  1. could be some kind of error correction. Models learn to rectify errors coming from superposition interference or another kind of noise.
  2. could be interpreted as more capable models picking up on subtler differences between the prompts and adjusting their predictions.
3eggsyntax
I endorse Scott's view in that piece. Assuming that the AIS research community is generally comfortable with a Bayesian view of probability (which I do), I see it as mostly orthogonal to this proposal.

For the two sets of mess3 parameters I checked the stationary distribution was uniform.

The activation patching, causal tracing and resample ablation terms seem to be out of date, compared to how you define them in your post on attribution patching.