Wiki Contributions

Comments

Sorted by
cmathw10

Thank you for the comment! Yep that is correct, I think perhaps variants of this approach could still be useful for resolving other forms of superposition within a single attention layer but not currently across different layers.

cmathw10

Thank you for the catch, that is correct, it should be [0, 1]. This was a relic I missed of an older alternative where we were using a modified tanh function to bound [0, 1), I'll update above accordingly!

cmathwΩ360

This is really interesting work and is presented in a way that makes it really useful for others to apply these methods to other tasks. A couple of quick questions:

  1. In this work, you take a clean run and patch over a specific activation from a corresponding corrupt run. If you had done this the other way around (ie. take a corrupt run and see which clean run activations nudge the model closer to the correct answer), do you think that one would find similar results? Do you think there should be a preference to the whether one patches clean --> corrupt or corrupt --> clean?
  2. Did you find that the corrupt dataset that you used to patch activations had a noticeable effect on the heads that appeared to be most relevant? Concretely, in the 'random answer' corrupt prompt (ie. replacing the correct answer C_def in the definition with a random word), did you find that the selection of this word mattered (ie. do you expect that selecting a word that would commonly be found in a function definition be superior to other random words in the model's vocab) or were results pretty consistent regardless?