Summary by OpenAI: We use GPT-4 to automatically write explanations for the behavior of neurons in large language models and to score those explanations. We release a dataset of these (imperfect) explanations and scores for every neuron in GPT-2.
Link: https://openai.com/research/language-models-can-explain-neurons-in-language-models
Please share your thoughts in the the comments!
This seems like a pretty promising approach to interpretability, and I think GPT-6 will probably be able to analyze all the neurons in itself with >0.5 scores. Which seems to be recursive self-improvement territory. It would be nice if by the time we got there, we already mostly knew how GPT-2, 3, 4, and 5 worked. Knowing how previous generation LLMs work is likely to be integral to aligning a next generation LLM and it's pretty clear that we're not going to be stopping development, so having some idea of what we're doing is better than none. Even if an AI moratorium is put in place, it would make sense for us to use GPT-4 to automate some of the neuron research going on right now. What we can hope for is that we do the most amount of work possible with GPT-4 before we jump to GPT-5 and beyond.
This seems to assume the task (writing explanations for all neurons with >0.5 scores) is possible at all, which is doubtful. Superposition and polysemanticity are certainly things that actually happen.