Interpretability Researcher at Apollo Research
Just checking -- you are aware that the reasoning traces shown in the UI are a summarized version of the full reasoning trace (which is not user accessible)?
See "Hiding the Chains of Thought" here.
See also their system card focusing on safety evals: https://openai.com/index/openai-o1-system-card/
Have you looked at how the dictionaries represent positional information? I worry that the SAEs will learn semi-local codes that intermix positional and semantic information in a way that makes things less interpretable.
To investigate this one can take each feature and could calculate the variance in activations that can explained by the position. If this variance-explained is either ~0% or ~100% for every head I'd be satisfied that positional and semantic information are being well separated into separate features.
In general, I think it makes sense to special-case positional information. Even if positional information is well separated I expect converting it into SAE features probably hurts interpretability. This is easy to do in shortformers[1] or rotary models (as positional information isn't added to the residual stream). One would have to work a bit harder for GPT2 but it still seems worthwhile imo.
Position embeddings are trained but only added to the key and query calculation, see Section 5 of this paper.
Cool paper! I enjoyed reading it and think it provides some useful information on what adding carefully chosen bias vectors into LLMs can achieve. Some assorted thoughts and observations.
Some other very minor comments:
To be clear, the authors don't claim this and I'm not intending this as a criticism of them.
My summary of the paper:
One question I have about the Mechanistic Interpretability safety case sketch from Roger Grosse: How do you actually find features of high concern?
Features of High Concern are things like "intent to carry out a harmful plan". By definition, they very rarely activate on benign deployment outputs. This must be the case as "if the features [of high concern] activate more than a handful of times in deployment (internal or external), we discontinue the model and roll back to a weaker model." [M1.1]
They propose to find these features [S1.1] by performing dictionary learning of some form, and then some combination of:
Generally dictionary learning only learns features that actually occur on some inputs. So you need to do something outside of the normal deployment distribution. Some ideas (many alluded to in the document):
The situation is easier for features of medium concern, which are only problematic in certain contexts. Ideally you can learn the features in benign contexts and they’ll transfer to catastrophic contexts. For example, it’s okay if your model lies when roleplaying as a shady car salesman, when playing mafia, or in other low-stakes scenarios. If you can use these to learn a “deception” feature it might activate when the model assures you that some backdoored code is safe. You’d still want to validate this transfer empirically!
I think it would be valuable to think about how this could work in specific scenarios. That is, specify different combinations of
Then try to answer the question of how you'd learn the feature such that it only activates in deployment when the model is behaving catastrophically. Do this for examples like
All of these strategies rely on the hope of feature transfer to the on-policy deployment distribution. It seems possible to get enough evidence to think it's likely these features transfer but difficult to have enough evidence to be highly confident. As they mention, having multiple realistic model organisms seems necessary for validation. Where do you get model organisms? Maybe