We took dot product over cosine similarity because the dot product is the neuron’s effect on the logits (since we use the dot product of the residual stream and embedding matrix when unembedding).
I think your point on using the scale if we are concerned about the scale of is fair — we didn’t really look at how the rest of the network interacted with this neuron through its input weights, but perhaps a input-scaled congruence score (e.g. output congruence * average of squared input weights) could give us a better representation of a neuron’s releva...
The prompt was in a style similar to the [Interpretability In The Wild](https://arxiv.org/abs/2211.00593) paper, where one token (' an') would be the top answer for the pre-patched prompt — the one with 'apple', and the other token (' a') would be the the top answer for the patched prompt — the one with 'lemon'. The idea is that with these prompts is that we know that the top prediction is either ' an' or ' a', and we can measure the effect of each individual part of the model by seeing how much patching that part of the model sways the prediction towards ... (read more)