I run the White Box Evaluations Team at the UK AI Security Institute. This is primarily a mechanistic interpretability team focussed on estimating and addressing risks associated with deceptive alignment. I'm a MATS 5.0 and ARENA 1.0 Alumni. Previously, I cofounded the AI Safety Research Infrastructure Org Decode Research and conducted independent research into mechanistic interpretability of decision transformers. I studied computational biology and statistics at the University of Melbourne in Australia.
Cool paper. I think the semantic similarity result is particularly interesting.
As I understand it you've got a circuit that wants to calculate something like Sim(A,B), where A and B might have many "senses" aka: features but the Sim might not be a linear function of each of thes Sims across all senses/features.
So for example, there are senses in which "Berkeley" and "California" are geographically related, and there might be a few other senses in which they are semantically related but probably none that really matter for copy suppression. For this reason wouldn't expect the tokens of each of to have cosine similarity that is predictive of the copy suppression score. This would only happen for really "mono-semantic tokens" that have only one sense (maybe you could test that).
Moreover, there are also tokens which you might want to ignore when doing copy suppression (speculatively). Eg: very common words or punctuations (the/and/etc).
I'd be interested if you have use something like SAE's to decompose the tokens into the underlying feature/s present at different intensities in each of these tokens (or the activations prior to the key/query projections). Follow up experiments could attempt to determine whether copy suppression could be better understood when the semantic subspaces are known. Some things that might be cool here:
- Show that some features are mapped to the null space of keys/queries in copy suppression heads indicating semantic senses / features that are ignored by copy suppression. Maybe multiple anti-induction heads compose (within or between layers) so that if one maps a feature to the null space, another doesn't (or some linear combination) or via a more complicated function of sets of features being used to inform suppression.
- Similarly, show that the OV circuit is suppressing the same features/features you think are being used to determine semantic similarity. If there's some asymmetry here, that could be interesting as it would correspond to "I calculate A and B as similar by their similarity in the *california axis* but I suppress predictions of any token that has the feature for anywhere on the West Coast*).
I'm particularly excited about this because it might represent a really good way to show how knowing features informs the quality of mechanistic explanations.
It's undesirable because it makes SAEs more confusing / harder to use but it's also a concrete obstacle to things like sparse feature circuits. For example, it causes automatic circuits composed of features to have issues. Instead of having a "starts with E" feature in a spelling circuit that fires on all "E" words, you have a "starts with E feature" that fires on most words (part of the circuit) + all the "E" absorbing features like "fires on the token Europe" (large number of features, adds lots of nodes to the circuit). So instead of being able to create a simple DAG, you get a complex one.
I don't remember signed GSEA being standard when I used it so no particular reason. Could try it.
I suspect this is a good experiment. It may help!
Interesting - I agree I don't have strong evidence here (and certainly we haven't it). It sounds like maybe it would be worth my attempting to prove this point with further investigation. This is an update for me - I didn't think this interpretation would be contentious!
than that it would flag text containing very OOD examples of covert honesty.
I think you mean covert "dishonesty"?
That is, in the prompted setting, I think that it will be pretty hard to tell whether your probe is working by detecting the presence of instructions about sandbagging vs. working by detecting the presence of cognition about sandbagging.
I think I agree but think this is very subtle so want to go into more depth here:
I think this is a valuable read for people who work in interp but feel like I want to add a few ideas:
I'd be excited for some empirical work following up on this. One idea might be to train toy models which are incentivised to contain imperfect detectors (eg; there is a noisy signal but reward is optimised by having a bias toward recall or precision in some of the intermediate inferences). Identifying intermediate representations in such models could be interesting.
I think I like a lot of the thinking in the post. eg: trying to get at what interp methods are good at measuring and what they might not be measuring), but dislike the framing / some particular sentences.
Thanks for posting this! I've had a lot of conversations with people lately about OthelloGPT and I think it's been useful for creating consensus about what we expect sparse autoencoders to recover in language models.
Maybe I missed it but:
I think a number of people expected SAEs trained on OthelloGPT to recover directions which aligned with the mine/their probe directions, though my personal opinion was that besides "this square is a legal move", it isn't clear that we should expect features to act as classifiers over the board state in the same way that probes do.
This reflects several intuitions:
Some other feedback:
Oh, and maybe you saw this already but an academic group put out this related work: https://arxiv.org/abs/2402.12201 I don't think they quantify the proportion of probe directions they recover, but they do indicate recovery of all types of features that been previously probed for. Likely worth a read if you haven't seen it.