This is perhaps the best interpretability work I've seen outside of Chris Olah's team.

New Comment
2 comments, sorted by Click to highlight new comments since:

Paper link: https://arxiv.org/abs/2407.20311

(I have neither watched the video nor read the paper yet, just in case someone else was looking for the non-video version)

Could you dig into why you think it's great inter work?