This is perhaps the best interpretability work I've seen outside of Chris Olah's team.
Paper link: https://arxiv.org/abs/2407.20311(I have neither watched the video nor read the paper yet, just in case someone else was looking for the non-video version)
Could you dig into why you think it's great inter work?
This is perhaps the best interpretability work I've seen outside of Chris Olah's team.