Is this consistent with the interpretation of self-attention as approximating (large) steps in a Hopfield network?
I have an old hypothesis about this which I might finally get to see tested. The idea is that the feedforward networks of a transformer create little attractor basins. Reasoning is twofold: the QK-circuit only passes very limited information to the OV circuit as to what information is present in other streams, which introduces noise into the residual stream during attention layers. Seeing this, I guess that another reason might be due to inferring concepts from limited information:
Consider that the prompts "The German physicist with the wacky hair is called" and "General relativity was first laid out by" will both lead to "Albert Einstein". Both of them will likely land in different parts of an attractor basin which will converge.
You can measure which parts of the network are doing the compression using differential optimization, in which we take d[OUTPUT]/d[INPUT] as normal, and compare to d[OUTPUT]/d[INPUT] when the activations of part of the network are "frozen". Moving from one region to another you'd see a positive value while in one basin, a large negative value at the border, and then another positive value in the next region.
I believe there are two phenomena happening during training
I hypothesize that
This research was completed for London AI Safety Research (LASR) Labs 2024. The team was supervised by @Stefan Heimershiem (Apollo Research). Find out more about the program and express interest in upcoming iterations here.
This video is a short overview of the project presented on the final day of the LASR Labs. Note that the paper was updated since then.
We study the effects of perturbing Transformer activations, building upon recent work by Gurnee, Lindsey, and Heimersheim & Mendel. Specifically, we interpolate between model-generated residual stream activations, and measure the change in the model output. Our initial results suggest that:
We believe that studying stable regions can improve our understanding of how neural networks work. The extent to which this understanding is useful for safety is an active topic of discussion 1 2 3.
The updated paper with additional plots in the appendix is not yet visible or arxiv, but you can read it here.