User Comment Replies

(OLD) An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers

Thanks for writing this - I've found it useful in my current attempts to survey some key mechanistic interpretability literature.

a decent survey paper on what’s up in the rest of interpretability.
I’m personally pretty meh about the majority of the academic field of interpretability

A bit confused by this. This paper's abstract and intro claim to be focusing on inner interpretability methods - which they define as learned features and internal structure. This seems to fit my idea of what mechanistic interpretability is pretty well, but you seem to classify i... (read more)

2Neel Nanda2y

This is a fair point! I honestly have only vaguely skimmed that survey, and got the impression there was a lot of stuff in there that I wasn't that interested in. But it's on my list to read properly at some point, and I can imagine updating this a bunch.

[Linkpost] A survey on over 300 works about interpretability in deep networks

infinitevoid3y50

Nice paper, thanks! A meta question - how did you analyse and systematise the results of over 300 papers? (gesturing at software tools/general methodology here)

6scasper3y

The taxonomy we introduced in the survey gave a helpful way of splitting up the problem. Other than that, it took a lot of effort, several google docs that got very messy, and https://www.connectedpapers.com/. Personally, I've also been working on interpretability for a while and have passively formed a mental model of the space.

LESSWRONG
LW

All of infinitevoid's Comments + Replies