Steering vectors are a recent and increasingly popular alignment technique. They are based on the observation that many features are encoded as linear directions in activation space; hence, intervening within this 1-dimensional subspace is an effective method for controlling that feature.
Can we extend this to nonlinear features? A simple example of a nonlinear feature is circular representations in modular arithmetic. Here, it's clear that a simple "steering vector" will not work. Nonetheless, as the authors show, it's possible to construct a nonlinear steering intervention that demonstrably influences the model to predict a different result.
Problem: The construction of a steering intervention in the modular addition paper relies heavily on the a-priori knowledge that the underlying feature geometry is a circle. Ideally, we wouldn't need to fully elucidate this geometry in order for steering to be effective.
Therefore, we want a procedure which learns a nonlinear steering intervention given only the model's activations and labels (e.g. the correct next-token).
Such a procedure might look something like this:
If this works, it might be applicable to other examples of nonlinear feature geometries as well.
Thanks to David Chanin for useful discussions.
You might be interested in works like Kernelized Concept Erasure, Representation Surgery: Theory and Practice of Affine Steering, Identifying Linear Relational Concepts in Large Language Models.
This is really interesting, thanks! As I understand, "affine steering" applies an affine map to the activations, and this is expressive enough to perform a "rotation" on the circle. David Chanin has told me before that LRC doesn't really work for steering vectors. Didn't grok kernelized concept erasure yet but will have another read.
Generally, I am quite excited to implement existing work on more general steering interventions and then check whether they can automatically learn to steer modular addition
This post is an exercise in public accountability and harnessing positive peer pressure for self-motivation.
By 1 October 2024, I am committing to have produced:
Habits I am committing to that will support this:
SAE feature directions are likely "special" rather than "random".
Re: the last point above, this points to singular learning theory being an effective tool for analysis.
I would say a better reference for the limitations of ROME is this paper: https://aclanthology.org/2023.findings-acl.733
Short explanation: Neel's short summary, i.e. editing in the Rome fact will also make slightly related questions e.g. "The Louvre is cool. Obama was born in" ... be completed with " Rome" too.
If we train several SAEs from scratch on the same set of model activations, are they “equivalent”?
Here are two notions of "equivalence:
A priori, I would expect that we get rough functional equivalence, but not feature equivalence. I think this experiment would help elucidate the underlying invariant geometrical structure that SAE features are suspected to be in.
Changelog:
Found this graph on the old sparse_coding channel on the eleuther discord:
Logan Riggs: For MCS across dicts of different sizes (as a baseline that's better, but not as good as dicts of same size/diff init). Notably layer 5 is sucks. Also, layer 2 was trained differently than the others, but I don't have the hyperparams or amount of training data on hand.
So at least tentatively that looks like "most features in a small SAE correspond one-to-one with features in a larger SAE trained on the activations of the same model on the same data".
Oh that's really interesting! Can you clarify what "MCS" means? And can you elaborate a bit on how I'm supposed to interpret these graphs?
Yeah, stands for Max Cosine Similarity. Cosine similarity is a pretty standard measure for how close two vectors are to pointing in the same direction. It's the cosine of the angle between the two vectors, so +1.0 means the vectors are pointing in exactly the same direction, 0.0 means the vectors are orthogonal, -1.0 means the vectors are pointing in exactly opposite directions.
To generate this graph, I think he took each of the learned features in the smaller dictionary, and then calculated to cosine similarity of that small-dictionary feature with every feature in the larger dictionary, and then the maximal cosine similarity was the MCS for that small-dictionary feature. I have a vague memory of him also doing some fancy linear_sum_assignment()
thing (to ensure that each feature in the large dictionary could only be used once in order avoid having multiple features in the small dictionary have their MCS come from the same feature on the large dictionary) though IIRC it didn't actually matter.
Also I think the small and large dictionaries were trained using different methods as each other for layer 2, and this was on pythia-70m-deduped so layer 5 was the final layer immediately before unembedding (so naively I'd expect most of the "features" to just be "the output token will be the
" or "the output token will be when
" etc).
Edit: In terms of "how to interpret these graphs", they're histograms with the horizontal axis being bins of cosine similarity, and the vertical axis being how many small-dictionary features had a the cosine similarity with a large-dictionary feature within that bucket. So you can see at layer 3 it looks like somewhere around half of the small dictionary features had a cosine similarity of 0.96-1.0 with one of the large dictionary features, and almost all of them had a cosine similarity of at least 0.8 with the best large-dictionary feature.
Which I read as "large dictionaries find basically the same features as small ones, plus some new ones".
Bear in mind also that these were some fairly small dictionaries. I think these charts were generated with this notebook so I think smaller_dict
was of size 2048 and larger_dict
was size 4096 (with a residual width of 512, so 4x and 8x respectively). Anthropic went all the way to 256x residual width with their "Towards Monosemanticity" paper later that year, and the behavior might have changed at that scale.
If we train several SAEs from scratch on the same set of model activations, are they “equivalent”?
For SAEs of different sizes, for most layers, the smaller SAE does contain very high similarity with some of the larger SAE features, but it's not always true. I'm working on an upcoming post on this.
Interesting, we find that all features in a smaller SAE have a feature in a larger SAE with cosine similarity > 0.7, but not all features in a larger SAE have a close relative in a smaller SAE (but about ~65% do have a close equavalent at 2x scale up).
A collection of empirical evidence
Do language models exhibit self-repair?
One notion of self-repair is redundancy; having "backup" components which do the same thing, should the original component fail for some reason. Some examples:
Another notion of self-repair is "regulation"; suppressing an overstimulated component.
A third notion of self-repair is "error correction".
Self-repair is annoying from the interpretability perspective.
A related thought: Grokked models probably do not exhibit self-repair.
Taken together, I guess this means that self-repair is a coping mechanism for the "noisiness" / "messiness" of real data like language.
It would be interesting to study whether introducing noise into synthetic data (that is normally grokkable by models) also breaks grokking (and thereby induces self-repair).
It's a fascinating phenomenon. If I had to bet I would say it isn't a coping mechanism but rather a particular manifestation of a deeper inductive bias of the learning process.
That's a really interesting blogpost, thanks for sharing! I skimmed it but I didn't really grasp the point you were making here. Can you explain what you think specifically causes self-repair?
I think self-repair might have lower free energy, in the sense that if you had two configurations of the weights, which "compute the same thing" but one of them has self-repair for a given behaviour and one doesn't, then the one with self-repair will have lower free energy (which is just a way of saying that if you integrate the Bayesian posterior in a neighbourhood of both, the one with self-repair gives you a higher number, i.e. its preferred).
That intuition is based on some understanding of what controls the asymptotic (in the dataset size) behaviour of the free energy (which is -log(integral of posterior over region)) and the example in that post. But to be clear it's just intuition. It should be possible to empirically check this somehow but it hasn't been done.
Basically the argument is self-repair => robustness of behaviour to small variations in the weights => low local learning coefficient => low free energy => preferred
I think by "specifically" you might be asking for a mechanism which causes the self-repair to develop? I have no idea.
This is an empirical observation about recent works on feature geometry, that (regular) polytopes are a recurring theme in feature geometry.
Simplices in models. Work studying hierarchical structure in feature geometry finds that sets of things are often represented as simplices, which are a specific kind of regular polytope. Simplices are also the structure of belief state geometry.
Regular polygons in models. Recent work studying natural language modular arithmetic has found that language models represent things in a circular fashion. I will contend that "circle" is a bit imprecise; these are actually regular polygons, which are the 2-dimensional versions of polytopes.
A reason why polytopes could be a natural unit of feature geometry is that they characterize linear regions of the activation space in ReLU networks. However, I will note that it's not clear that this motivation for polytopes coincides very well with the empirical observations above.
I’m concerned that progress in interpretability research is ephemeral, being driven primarily by proxy metrics that may be disconnected from the end goal (understanding by humans). (Example: optimising for the L0 metric in SAE interpretability research may lead us to models that have more split features, even when this is unintuitive by human reckoning.)
It seems important for the field to agree on some common benchmark / proxy metric that is proven to be indicative of downstream human-rated interpretability, but I don’t know of anyone doing this. Similar to the role of BLEU in facilitating progress in NLP, I imagine having a standard metric would enable much more rapid and concrete progress in interpretability.
This is a paper reproduction in service of achieving my seasonal goals
Recently, it was demonstrated that circular features are used in the computation of modular addition tasks in language models. I've reproduced this for GPT-2 small in this Colab.
We've confirmed that days of the week do appear to be represented in a circular fashion in the model. Furthermore, looking at feature dashboards agrees with the discovery; this suggests that simply looking up features that detect tokens in the same conceptual 'category' could be another way of finding clusters of features with interesting geometry.
Next steps:
1. Here, we've selected 9 SAE features, gotten the reconstruction, and then compressed this down via PCA. However, were all 9 features necessary? Could we remove some of them without hurting the visualization?
2. The SAE reconstruction using 9 features is probably a very small component of the model's overall representation of this token. What's in the rest of the representation? Is it mostly orthogonal to the SAE reconstruction, or is there a sizeable component remaining in this 9-dimensional subspace? If the latter, it would indicate that the SAE representation here is not a 'full' representation of the original model.
Thanks to Egg Syntax for pair programming and Josh Engels for help with the reproduction.
Steganography; the idea that models may say one thing but mean another, and that this may enable them to evade supervision. Essentially, models might learn to "speak in code".
In order to better study steganography, it would be useful to construct model organisms of steganography, which we don't have at the moment. How might we do this? I think out-of-context meta learning is a very convenient path.
Out-of-context meta learning: The idea that models can internalise knowledge during training, and then use this knowledge at test-time to answer questions. E.g.
This has been demonstrated to be possible in Krasheninnikov et al, 2024 as well as Berglund et al, 2024.
This seems like a pretty effective path to creating model organisms of steganography. E.g.
I probably won't work on this myself, but I'm pretty interested in someone doing this and reporting their results
Rough summary of a conversation I had with Aengus Lynch
As a mech interp researcher, one thing I've been trying to do recently is to figure out my big cruxes for mech interp, and then filter projects by whether they are related to these cruxes.
Aengus made the counterpoint that this can be dangerous, because even the best researchers' mental model of what will be impactful in the future is likely wrong, and errors will compound through time. Also, time spent refining a mental model is time not spent doing real work. Instead, he advocated for working on projects that seem likely to yield near-term value
I still think I got a lot of value out of thinking about my cruxes, but I agree with the sentiment that this shouldn't consume excessive amounts of my time
A rough summary of an insightful discussion with Adam Gleave, FAR AI
We want our models to be adversarially robust.
One technique which FAR AI has investigated extensively (in Go models) is adversarial training.
One thing I pitched to Adam was the notion of "adversarial robustness through grokking".
Adam seemed thoughtful, but had some key concerns.
His overall take was that it's mainly an "empirical question" whether grokking leads to adversarial robustness. He hadn't heard this idea before, but thought experiments / proofs of concept would be useful.
A rough summary of insightful discussions with Jake Mendel and Victor Veitch
Recent work on hierarchical feature geometry has made two specific predictions:
Example of hierarchical decomposition: A dalmation is a dog, which is a mammal, which is an animal. Writing this hierarchically, Dalmation < Dog < Mammal < Animal. In this context, the two propositions imply that:
According to Victor Veitch, the load-bearing assumption here is that different levels of the hierarchy are disentangled, and hence models want to represent them orthogonally. I.e. $x_{animal}$ is perpendicular to $x_{mammal | animal}$. I don't have a super rigorous explanation for why, but it's likely because this facilitates representing / sensing each thing independently.
Jake Mendel made the interesting point that, as long as the number of features is less than the number of dimensions, an orthogonal set of vectors will satisfy P1 and P2 for any hierarchy.
Example of P2 being satisfied. Let's say we have vectors $x_{animal} = (0,1)$ and $x_{plant} = (1,0)$, which are orthogonal. Then we could write $x_{living_thing} = (1/sqrt(2), 1/ sqrt(2))$. Then $x_{animal | living_thing}, x_{plant | living_thing}$ would form a 1-dimensional simplex.
Example of P1 being satisfied. Let's say we have four things A, B, C, D arranged in a binary tree such that AB, CD are pairs. Then we could write $x_A = x_{AB} + x_{A | AB}$, satisfying both P1 and P2. However, if we had an alternate hierarchy where AC and BD were pairs, we could still write $x_A = x_{AC} + x_{A | AC}$. Therefore hierarchy is in some sense an "illusion", as any hierarchy satisfies the propositions.
Taking these two points together, the interesting scenario is when we have more features than dimensions, i.e. the setting of superposition. Then we have the two conflicting incentives:
This points to super interesting questions:
Experiments on toy models will probably be very informative here.
Note: This thinking is cached from before the bilinear sparse autoencoders paper. I need to read that and revisit my thoughts here.
Attention-head superposition (AHS) was introduced in this Anthropic post from 2023. Briefly, AHS is the idea that models may use a small number of attention heads to approximate the effect of having many more attention heads.
Definition 1: OV-incoherence. An attention circuit is OV-incoherent if it attends from multiple different tokens back to a single token, and the output depends on the token attended from.
Example 2: Skip-trigram circuits. A skip trigram consists of a sequence [A]...[B] -> [C], where A, B, C are distinct tokens.
Claim 3: A single head cannot implement multiple OV-incoherent circuits. Recall from A Mathematical Framework that an attention head can be decomposed into the OV circuit and the QK circuit, which operate independently. Within each head, the OV circuit is solely responsible for mapping linear directions in the input to linear directions in the output. only the query token. Since it does not see the key token, it must compute a fixed function of the query.
Claim 4: Models compute many OV-incoherent circuits simultaneously in superposition. If the ground-truth data is best explained by a large number of OV-incoherent circuits, then models will approximate having these circuits by placing them in superposition across their limited number of attention heads.
An attention transcoder (ATC) is described as follows:
Remark 5: The ATC architecture is the generalization of other successful SAE-like architectures to attention blocks.
Remark 6: Intervening on ATC heads. Since the ATC reconstructs the output of an attention block, ablations can be done by simply splicing the ATC into the model's computational graph and intervening directly on individual head outputs.
Remark 7: Attributing ATC heads to ground-truth heads. In standard attention-out SAEs, it's possible to directly compute the attribution of each head to an SAE feature. That seems impossible here because the ATC head outputs are not direct functions of the ground-truth heads. Nonetheless, if ATC heads seem highly interpretable and accurately reconstruct the real attention outputs, and specific predictions can be verified via interventions, it seems reasonable to conclude that they are a good explanation of how attention blocks are working.
Does AHS actually occur in language models? I think we do not have crisp examples at the moment.
The first and most obvious experiment is to try training an ATC and see if it works.
Conditional on that succeeding, the next step would be to attempt to interpret individual heads in an ATC and determine whether they are interpretable.
[Draft][Note] On Singular Learning Theory
Relevant links
It's an open question whether SAEs capture underlying properties of feature geometry. Fortunately, careful research has elucidated a few examples of nonlinear geometry already. It would be useful to think about whether SAEs recover these geometries.
Simplices in models. Work studying hierarchical structure in feature geometry finds that sets of things are often represented as simplices, which are a specific kind of regular polytope. Simplices are also the structure of belief state geometry.
The proposal here is: look at the SAE activations for the tetrahedron, identify a relevant cluster, and then evaluate whether this matches the ground-truth.
Superposition is currently the dominant hypothesis to explain polysemanticity in neural networks. However, how much better does it explain the data than alternative hypotheses?
Non-neuron aligned basis. The leading alternative, as asserted by Lawrence Chan here, is that there are not a very large number of underlying features; just that these features are not represented in a neuron-aligned way, so individual neurons appear to fire on multiple distinct features.
The Local Interaction Basis explores this idea in more depth. Starting from the premise that there is a linear and interpretable basis that is not overcomplete, they propose a method to recover such a basis, which works in toy models. However, empirical results in language models fail to demonstrate that the recovered basis is indeed more interpretable.
My conclusion from this is a big downwards update on the likelihood of the "non-neuron aligned basis" in realistic domains like natural language. The real world probably just is complex enough that there are tons of distinct features which represent reality.
You'll enjoy reading What Causes Polysemanticity? An Alternative Origin Story of Mixed Selectivity from Incidental Causes (link to the paper)
Using a combination of theory and experiments, we show that incidental polysemanticity can arise due to multiple reasons including regularization and neural noise; this incidental polysemanticity occurs because random initialization can, by chance alone, initially assign multiple features to the same neuron, and the training dynamics then strengthen such overlap.
Previous work on grokking finds that models can grok modular addition and tree search. However, these are not tasks formulated in natural language. Instead, the tokens correspond directly to true underlying abstract entities, such as numerical values or nodes in a graph. I question whether this representational simplicity is a key ingredient of grokking reasoning.
I have a prior that expressing concepts in natural language (as opposed to directly representing concepts as tokens) introduces an additional layer of complexity which makes grokking much more difficult.
The proposal here is to repeat the experiments with tasks that test equivalent reasoning skills, but which are formulated in natural language.
I'd expect that we could observe grokking, but that it might take a lot longer (and require larger models) when compared to the "direct concept tokenization". Conditioned on this being true, it would be interesting to observe whether we recover the same kinds of circuits as demonstrated in prior work.
Universal features. Work such as the Platonic Representation Hypothesis suggest that sufficiently capable models converge to the same representations of the data. To me, this indicates that the underlying "entities" which make up reality are universally agreed upon by models.
Non-universal circuits. There are many different algorithms which could correctly solve the same problem. Prior work such as the clock and the pizza indicate that, even for very simple algorithms, models can learn very different algorithms depending on the "attention rate".
Circuit universality is a crux. If circuits are mostly model-specific rather than being universal, it makes the near-term impact of MI a lot lower, since finding a circuit in one model tells us very little about what a slightly different model is doing.
Concrete experiment: Evaluating the universality of IOI. Gurnee et al train several GPT-2 small checkpoints from scratch. We know from prior work that GPT-2 small has an IOI circuit. What, if any, components of this turn to be universal? Maybe we always observe induction heads. But do we always observe name-mover and S-inhibition heads? If so, are they always at the same layer? Etc. I think this experiment would inform us a lot about circuit universality.