Summary by OpenAI: We use GPT-4 to automatically write explanations for the behavior of neurons in large language models and to score those explanations. We release a dataset of these (imperfect) explanations and scores for every neuron in GPT-2.
Link: https://openai.com/research/language-models-can-explain-neurons-in-language-models
Please share your thoughts in the the comments!
You need a larger model to interpret your model, and you want to make a model understand a model. Does not look safe!
This feel reminiscent of:
And while it's a well-constructed pithy quote, I don't think it's true. Can a system understand itself? Can a quining computer program exist? Where is the line between being able to recite itself and understand itself?
... (read more)