All of Sonia Joseph's Comments + Replies

Ok, thank you for your openness. I find that in-person conversations about sensitive matters like these are easier as tone, facial expression, body language are very important here. It is possible that my past comments on EA that you refer to came off as more hostile than intended due to the text-based medium.

Fwiw, the contents of this original post actually have nothing to do with EA itself, or the past articles that mentioned me.

3habryka
Makes sense. My experience has been that in-person conversations are helpful for getting on the same page, but they also often come with confidentiality requests that then make it very hard for information to propagate back out into the broader social fabric, and that often makes those conversations more costly than beneficial. But I do think it's a good starting point if you don't do the very costly confidentiality stuff. Yep, that makes sense. I wasn't trying to imply that it was (but still seems good to clarify).

Hi habryka,

 

Thank you for your comment. It contains a few assumptions that are not quite true. I am not sure that the comment section here is the best place to address them, and in person diplomacy may be wise. I would be down to get coffee the next time we are in the same city and discuss in more detail.

4habryka
Sure, happy to chat sometime.  I haven't looked into the things I mentioned in a ton of detail (though have spent a few hours on it), but have learned to err on the side of sharing my takes here (where even if they are wrong, it seems better to have them be in the open so that people correct them and people can track what I believe even if they think it's dumb/wrong).

Apologies, the post is still getting approved by the EA forum as I've never posted there under this account.

Thanks for your comment. Some follow-up thoughts, especially regarding your second point:

There is sometimes an implicit zeitgeist in the mech interp community that other modalities will simply be an extension or subcase of language. 

I want to flip the frame, and consider the case where other modalities may actually be a more general case for mech interp than language. As a loose analogy, the relationship between language mech interp and multimodal mech interp may be like the relationship between algebra and abstract algebra. I have two points here.

Ali... (read more)

Thank you for this. How would you think about the pros/cons of influence functions vs activation patching or direct logit attribution in terms of localizing a behavior in the model? 

Right now, there's a lot to exploit with CLIP and ViTs so that will be the focus for awhile. We may expand to Flamingo or other models if there is demand.

Other modalities would be fascinating. I imagine they have their own idiosyncrasies. I would be interested in audio in the future but not at the expense of first exploiting vision. 

Ideally, yes; a unified interp framework for any modality is the north star. I do think this will be a community effort. Research in language built off findings from many different groups and institutions. Vision and other modalities are currently just not in the same place.

It was surprising to me too. It is possible that the layers do not have aligned basis vectors. That's why corroborating the results with a TunedLens is a smart next step, as they currently may be misleading.

Noted, and thank you for flagging. I mostly agree, and do not have much to add (as we seem mostly in agreement that diverse, bluesky research is good), other than this may shape the way I present this project going forward.

Thank you for this write-up! 

I am wondering how to relate causal scrubbing to @Arthur Conmy's ACDC method. 

It seems that causal scrubbing becomes relevant when one is testing a relatively specific hypothesis (e.g. induction heads), while ACDC can work with simply a dataset, metric, behavior? If so, would it be accurate to say that ACDC would be a more general pass, and part of an earlier workflow, to develop your hypothesis? And causal scrubbing can validate it? Curious about trade-offs in types of insight, resources, computational complexity, positioning in one's mech interp workflow, and in what circumstances one would use each.  

3Adrià Garriga-alonso
If you're still interested in this, we have now added Appendix N to the paper, which explains our final take.
3Arthur Conmy
(Personal opinion as an ACDC author) I would agree that causal scrubbing aims mostly to test hypotheses and ACDC does a general pass over model components to hopefully develop a hypothesis (see also the causal scrubbing post's related work and ACDC paper section 2). These generally feel like different steps in a project to me (though some posts in the causal scrubbing sequence show that the method can generate hypotheses, and also ACDC was inspired by causal scrubbing). On the practical side, the considerations in How to Think About Activation Patching may be a helpful field guide for projects where one of several tools might be best for a given use case. I think both causal scrubbing are currently somewhat inefficient for different reasons; causal scrubbing is slow because causal diagrams have lots of paths in them, and ACDC is slow as computational graphs have lots of edges in them (follow up work to ACDC will do gradient descent over the edges).