Summary by OpenAI: We use GPT-4 to automatically write explanations for the behavior of neurons in large language models and to score those explanations. We release a dataset of these (imperfect) explanations and scores for every neuron in GPT-2.
Link: https://openai.com/research/language-models-can-explain-neurons-in-language-models
Please share your thoughts in the the comments!
Inasmuch as this shtick works at all, that doesn't seem necessarily true to me? You need a model above some threshold of capability at which it can provide useful interpretations, yes, but I don't see any obvious reason why that threshold would move up with the size of the model under interpretation. The number of neurons/circuits to be interpreted will increase, but the complexity of any single interpretation? At the very least, that's a non-trivial claim in need of support.
I don't think that's particularly risky at all. A model that wasn't dangerous before you fed it data about some other model (or, indeed, about itself) isn't going to become dangerous after it understands. In turn, a model that is dangerous after you let it do science, has been dangerous from the get-go.
We probably shouldn't have trained GPT-4 to begin with; but given that we have, and didn't die, the least we can do is fully utilize the resultant tool.