Once upon a time, the sun let out a powerful beam of light which shattered the world. The air and the liquid was split, turning into body and breath. Body and breath became fire, trees and animals. In the presence of the lightray, any attempt to reunite simply created more shards, of mushrooms, carnivores, herbivores and humans. The hunter, the pastoralist, the farmer and the bandit. The king, the blacksmith, the merchant, the butcher. Money, lords, bureaucrats, knights, and scholars. As the sun cleaved through the world, history progressed, creating endless forms most beautiful.
It would be perverse to try to understand a king in terms of his molecular configuration, rather than in the contact between the farmer and the bandit. The molecules of the king are highly diminished phenomena, and if they have information about his place in the ecology, that information is widely spread out across all the molecules and easily lost just by missing a small fraction of them. Any thing can only be understood in terms of the greater forms that were shattered from the world, and this includes neural networks too.
But through gradient descent, shards act upon the neural networks by leaving imprints of themselves, and these imprints have no reason to be concentrated in any one spot of the network (whether activation-space or weight-space). So studying weights and activations is pretty doomed. In principle it's more relevant to study how external objects like the dataset influence the network, though this is complicated by the fact that the datasets themselves are a mishmash of all sorts of random trash[1].
Probably the most relevant approach for current LLMs is Janus's, which focuses on how the different styles of "alignment" performed by the companies affect the AIs, qualitatively speaking. Alternatively, when one has scaffolding that couples important real-world shards to the interchangeable LLMs, one can study how the different LLMs channel the shards in different ways.
Admittedly, it's very plausible that future AIs will use some architectures that bias the representations to be more concentrated in their dimensions, both to improve interpretability and to improve agency. And maybe mechanistic interpretability will work better for such AIs. But we're not there yet.
- ^
Possibly clustering the data points by their network gradients would be a way to put some order into this mess? But two problems: 1) The data points themselves are merely diminished fragments of the bigger picture, so the clustering will not be properly faithful to the shard structure, 2) The gradients are as big as the network's weights, so this clustering would be epically expensive to compute.
What does 'one spot' mean here?
If you just mean 'a particular entry or set of entries of the weight vector in the standard basis the network is initalised in', then sure, I agree.
But that just means you have to figure out a different representation of the weights, one that carves the logic flow of the algorithm the network learned at its joints. Such a representation may not have much reason to line up well with any particular neurons, layers, attention heads or any other elements we use to talk about the architecture of the network. That doesn't mean it doesn't exist.
Nontrivial algorithms of LLMs require scaffolding and so aren't really concentrated within the network's internal computation flow. Even something as simple as generating text requires repeatedly feeding sampled tokens back to the network, which means that the network has an extra connection from outputs to input that is rarely modelled by mechanistic interpretability.