Hi habryka,
Thank you for your comment. It contains a few assumptions that are not quite true. I am not sure that the comment section here is the best place to address them, and in person diplomacy may be wise. I would be down to get coffee the next time we are in the same city and discuss in more detail.
Apologies, the post is still getting approved by the EA forum as I've never posted there under this account.
Good note, thank you.
Thanks for your comment. Some follow-up thoughts, especially regarding your second point:
There is sometimes an implicit zeitgeist in the mech interp community that other modalities will simply be an extension or subcase of language.
I want to flip the frame, and consider the case where other modalities may actually be a more general case for mech interp than language. As a loose analogy, the relationship between language mech interp and multimodal mech interp may be like the relationship between algebra and abstract algebra. I have two points here.
Ali...
Thank you for this. How would you think about the pros/cons of influence functions vs activation patching or direct logit attribution in terms of localizing a behavior in the model?
Right now, there's a lot to exploit with CLIP and ViTs so that will be the focus for awhile. We may expand to Flamingo or other models if there is demand.
Other modalities would be fascinating. I imagine they have their own idiosyncrasies. I would be interested in audio in the future but not at the expense of first exploiting vision.
Ideally, yes; a unified interp framework for any modality is the north star. I do think this will be a community effort. Research in language built off findings from many different groups and institutions. Vision and other modalities are currently just not in the same place.
It was surprising to me too. It is possible that the layers do not have aligned basis vectors. That's why corroborating the results with a TunedLens is a smart next step, as they currently may be misleading.
Noted, and thank you for flagging. I mostly agree, and do not have much to add (as we seem mostly in agreement that diverse, bluesky research is good), other than this may shape the way I present this project going forward.
Thank you for this write-up!
I am wondering how to relate causal scrubbing to @Arthur Conmy's ACDC method.
It seems that causal scrubbing becomes relevant when one is testing a relatively specific hypothesis (e.g. induction heads), while ACDC can work with simply a dataset, metric, behavior? If so, would it be accurate to say that ACDC would be a more general pass, and part of an earlier workflow, to develop your hypothesis? And causal scrubbing can validate it? Curious about trade-offs in types of insight, resources, computational complexity, positioning in one's mech interp workflow, and in what circumstances one would use each.
Ok, thank you for your openness. I find that in-person conversations about sensitive matters like these are easier as tone, facial expression, body language are very important here. It is possible that my past comments on EA that you refer to came off as more hostile than intended due to the text-based medium.
Fwiw, the contents of this original post actually have nothing to do with EA itself, or the past articles that mentioned me.