This is perhaps the best interpretability work I've seen outside of Chris Olah's team.

New Comment
1 comment, sorted by Click to highlight new comments since:

Paper link: https://arxiv.org/abs/2407.20311

(I have neither watched the video nor read the paper yet, just in case someone else was looking for the non-video version)