Occam's Laser

Posts

Sorted by New

13graphpatch: a Python Library for Activation Patching

5mo

2

Wiki Contributions

Comments

Sorted by

Newest

graphpatch: a Python Library for Activation Patching

Occam's Laser5mo30

Thanks! You’re correct that you can implement ROME with vanilla hooks, since these give you access to module inputs in addition to the outputs. But the fact that this works is contingent on both the specific interventions ROME makes and the way Llama/GPT2 happen to be implemented. To get maybe overly concrete, in this line

return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))

ROME wants the result of the multiplication, which isn’t the output of any individual submodule. You happen to be able to access it as the input of down_proj, because that happens to be a module, but it didn’t have to be implemented this way. (This would be even worse if we wanted to patch the value instead of just observing it, since we’d have to patch every consumer, and those would all also have to be modules or we’d be SOL). It's easy to imagine ROME-adjacent experiments that you might want to do that you simply can't with module hooks alone, which bothered me. The TransformerLens answer to this is to wrap everything in a submodule (HookPoint), which works well enough for the models that have already been converted, but struck me as a sufficiently “wrong” approach (hard to maintain, requires upfront work for every new model) that I wrote a library about it :)

Reply