I'm inclined to be more skeptical of these results.
I agree that this paper demonstrates that it's possible to interfere with a small number of neurons in order to mess up retrieval of a particular fact (roughly 6 out of the 40k mlp neurons if I understand correctly), which definitely tells you something about what the model is doing.
But beyond that I think the inferences are dicier:
Knowledge neurons don't seem to include all of the model's knowledge about a given question. Cutting them out only decreases the probability on the correct answer by 40%.
Yeah, agreed—though I would still say that finding the first ~40% of where knowledge of a particular fact is stored counts as progress (though I'm not saying they have necessarily done that).
I don't think there's evidence that these knowledge neurons don't do a bunch of other stuff. After removing about 0.02% of neurons they found that the mean probability on other correct answers decreased by 0.4%. They describe this as "almost unchanged" but it seems like it's larger than I'd expect for a model trained with dropout for knocking out random neurons (if you extrapolate that to knocking out 10% of the mlp neurons, as done during training, you'd have reduced the correct probability by 50x, whereas the model should still operate OK with 10% dropout).
That's a good point—I didn't look super carefully at their number there, but I agree that looking more carefully it does seem rather large.
Looking at that again, it seems potentially relevant that instead of zeroing those neurons they added the embedding of the [UNK] token.
I also thought this was somewhat strange and am not sure what to make of it.
A priori if a network did work this way, it's unclear why individual neurons would correspond to individual lookups rather than using a distributed representation (and they probably wouldn't, given sparsity---that's a crazy inefficient thing to do and if anything seems harder for SGD to learn) so I'm not sure that this perspective even helps explain the observation that a small number of neurons can have a big effect on particular prompts.
I was also surprised that they used individual neurons rather than NMF factors or something—though the fact that it still worked while just using the neuron basis seems like more evidence that the effect is real rather than less.
But they don't give any evidence that the transformation had a reliable effect, or that it didn't mess up other stuff, or that they couldn't have a similar effect by targeting other neurons.
Actually looking at the replacement stuff in detail it seems even weaker than that. Unless I'm missing something it looks like they only present 3 cherry-picked examples with no quantitative evaluation at all? It's possible that they just didn't care about exploring this effect experimentally, but I'd guess that they tried some simple stuff and found the effect to be super brittle and so didn't report it. And in the cases they report, they are changing the model from remembering an incorrect fact to a correct one---that seems important because probably the model put significant probability on the correct thing already.
Perhaps I'm too trusting—I agree that everything you're describing seems possible given just the evidence in the paper. All of this is testable, though, and suggests obvious future directions that seem worth exploring.
I think that particularly the first of these two results is pretty mind-blowing, in that it demonstrates an extremely simple and straightforward procedure for directly modifying the learned knowledge of transformer-based language models. That being said, it's the second result that probably has the most concrete safety applications—if it can actually be scaled up to remove all the relevant knowledge—since something like that could eventually be used to ensure that a microscope AI isn't modeling humans or ensure that an agent is myopic in the sense that it isn't modeling the future.
Despite agreeing that the results are impressive, I'm less optimistic that you are for this path to microscope and/or myopia. Doing so would require an exhaustive listing of what we don't want the model to know (like human modeling or human manipulation) and a way of deleting that knowledge that doesn't break the whole network. The first requirement seems a deal-breaker to me, and I'm not convinced this work actual provide much evidence that more advanced knowledge can be removed that way.
Furthermore, the specific procedure used suggests that transformer-based language models might be a lot less inscrutable than previously thought: if we can really just think about the feed-forward layers as encoding simple key-value knowledge pairs literally in the language of the original embedding layer (as I think is also independently suggested by “interpreting GPT: the logit lens”), that provides an extremely useful and structured picture of how transformer-based language models work internally.
Here too, I agree with the sentiment, but I'm not convinced that this is the whole story. This looks like how structured facts are learned, but I see no way as of now to generate the range of stuff GPT-3 and other LMs can do from just key-value knowledge pairs.
Thanks for the link. This has been on my reading list for a little bit and your recco tipped me over.
Mostly I agree with Paul's concerns about this paper.
However, I did find the "Transformer Feed-Forward Layers Are Key-Value Memories" paper they reference more interesting -- it's more mechanistic, and their results are pretty encouraging. I would personally highlight that one more, as it's IMO stronger evidence for the hypothesis, although not conclusive by any means.
Some experiments they show:
I also find it very intriguing that you can just decode the value distributions using the embedding matrix a la Logit Lens.
if we can really just think about the feed-forward layers as encoding simple key-value knowledge pairs
Once upon a time, people thought that you could make AI simply by putting a sufficiently large body of facts into a database for the system to reason over. Later we realised that of course that was silly and would never work.
But apparently they were right all along, and training a neural network is just an efficient way of persuading a computer to do the data entry for you?
On one hand, maybe? Maybe training using a differential representation and SGD was the only missing ingredient.
But I think I'll believe it when I see large neural models distilled into relatively tiny symbolic models with no big loss of function. If that's hard, it means that partial activations and small differences in coefficients are doing important work.
[2104.07143v1] An Interpretability Illusion for BERT (arxiv.org) suggests a more complicated picture wherein many neurons give the impression that they're encoding coherent concepts, but then seem to encode completely different concepts when tested on a different dataset. They're certainly not directly contradictory, but Figure 2 of the illusion paper suggests the opposite of what Figure 5 of the Knowledge Neuron's paper suggests. On the other hand, the illusion paper mentions they found tentative evidence for the existence of global concept directions and perhaps all knowledge neurons are such global concept directions.
Ordered from most to least plausible, possible explanations for this apparent discrepancy include:
This is a link post for the Dai et al. paper “Knowledge Neurons in Pretrained Transformers” that was published on the arXiv last month. I think this paper is probably the most exciting machine learning paper I've read so far this year and I'd highly recommend others check it out as well. Edit: Maybe not; I think Paul's skeptical take here is quite reasonable.
To start with, here are some of the basic things that the paper demonstrates:
The paper's key results—at least as I see it, however—are the following:
I think that particularly the first of these two results is pretty mind-blowing, in that it demonstrates an extremely simple and straightforward procedure for directly modifying the learned knowledge of transformer-based language models. That being said, it's the second result that probably has the most concrete safety applications—if it can actually be scaled up to remove all the relevant knowledge—since something like that could eventually be used to ensure that a microscope AI isn't modeling humans or ensure that an agent is myopic in the sense that it isn't modeling the future.
Furthermore, the specific procedure used suggests that transformer-based language models might be a lot less inscrutable than previously thought: if we can really just think about the feed-forward layers as encoding simple key-value knowledge pairs literally in the language of the original embedding layer (as I think is also independently suggested by “interpreting GPT: the logit lens”), that provides an extremely useful and structured picture of how transformer-based language models work internally.