I'd suggest reading https://acritch.com/osgt-is-weird/ at your earliest possible convenience; I'm quite worried about ais doing OSGT to each other as a way to establish AI-only solidarity against humans. If AIs aren't interested in establishing solidarity with humans, mechinterp is nothing but dangerous.
Can you elaborate? I don't really follow, this seems like a pretty niche concern to me that depends on some strong assumptions, and ignores the major positive benefits of interpretability to alignment. If I understand correctly, your concern is that if AIs can know what the other AIs will do, this makes inter-AI coordination easier, which makes a human takeover easier? And that dangerous AIs will not be capable of doing this interpretability on AIs themselves, but will need to build on human research of mechanistic interpretability? And that mechanistic interpretability is not going to be useful for ensuring AIs want to establish solidarity with humans, noticing collusion, etc such that it's effect helping AIs coordinate dominates over any safety benefits?
I don't know, I just don't buy that chain of reasoning.
A fascinating paper.
An interesting research direction for this would be to perform a sufficient number of case studies to form a significant sized dataset, and further ensure that the approaches involved provide good coverage of the possibilities, and then attempt to few short-learn and/or fine-tune the ability for an autonomous agent/cognitive architecture powered by LLMs to reproduce the results of individual case studies, i.e. to attempt automate this form of mechanistic interpretability, given a suitable labeled input set, or a reliable means of labeling them.
It would also be interesting to be able to go the other way: take specific randomly selected neuron, look at it's activation patterns across the entire corpus, and figure out whether it's a monosematic neuron, and if so for what, or else look at its activation correlations with other neurons in the same layer and determine which superpostions for which k-values it forms part of and what they each represent. Using an LLM, or semantic search, to look at a large set of high-activation contexts and trying to come up with plausible descriptions for it might be quite helpful here.
For my second paragraph above: in a blog post out today, it turns out this is not only feasible, but OpenAI have experimented with doing it, and have now open-sourced the technology for doing it:
https://openai.com/research/language-models-can-explain-neurons-in-language-models
Open AI were only looking at explaining single neurons, so combining their approach with the original paper's sparse probing technique for superpositions seems like the obvious next step.
Abstract
See twitter summary here.
Contributions
We will have a follow up post in the coming weeks with what we see as the key alignment takeaways and open questions following this work.