User Comment Replies

The prompt was in a style similar to the [Interpretability In The Wild](https://arxiv.org/abs/2211.00593) paper, where one token (' an') would be the top answer for the pre-patched prompt — the one with 'apple', and the other token (' a') would be the the top answer for the patched prompt — the one with 'lemon'. The idea is that with these prompts is that we know that the top prediction is either ' an' or ' a', and we can measure the effect of each individual part of the model by seeing how much patching that part of the model sways the prediction towards ... (read more)

1scasper2y

Thanks, but I'm asking more about why you chose to study this particular thing instead of something else entirely. For example, why not study "this" versus "that" completions or any number of other simple things in the language model?

We Found An Neuron in GPT-2

Clement Neo2y61

We took dot product over cosine similarity because the dot product is the neuron’s effect on the logits (since we use the dot product of the residual stream and embedding matrix when unembedding).

I think your point on using the scale $W_{i n}$ if we are concerned about the scale of $W_{o u t}$ is fair — we didn’t really look at how the rest of the network interacted with this neuron through its input weights, but perhaps a input-scaled congruence score (e.g. output congruence * average of squared input weights) could give us a better representation of a neuron’s releva... (read more)

LESSWRONG
LW

All of Clement Neo's Comments + Replies