This dialogue mostly makes me want to rant about how all y'all are doing mech interp wrong. So, rant time. This-is-a-rant-so-not-necessarily-reflectively-endorsed, etc.
Starting point: Science In A High-Dimensional World. Quoting from that post:
In a high-dimensional world like ours, there are billions of variables which could influence an outcome. The great challenge is to figure out which variables are directly relevant - i.e. which variables mediate the influence of everything else. In practice, this looks like finding mediators and hunting down sources of randomness. Once we have a set of control variables which is sufficient to (approximately) determine the outcome, we can (approximately) rule out the relevance of any other variables in the rest of the universe, given the control variables.
A remarkable empirical finding across many scientific fields, at many different scales and levels of abstraction, is that a small set of control variables usually suffices. Most of the universe is not directly relevant to most outcomes most of the time.
Ultimately, this is a picture of “gears-level science”: look for mediation, hunt down sources of randomness, rule out the influence of all the other variables in the universe.
This applies to interpretability just like any other scientific field. The real gold-standard thing to look for is some relatively-small set of variables which determine some other variables, basically-deterministically. Or, slightly weaker: a relatively-small Markov blanket which screens off some chunk of the system from everything else.
In order for this to be useful, the determinism/screening does need pretty high precision - e.g. Ryan's 99% number sounds like a reasonable day-to-day heuristic, many nines might be needed if there's a lot of bits involved, etc.
On the flip side, this does not necessarily need to look like a complete mechanistic explanation. Ideally, findings of screening are the building blocks from which a complete mechanistic model is built. The key point is that findings of screening provide an intermediate unit of progress, in between "no clue what's going on" and "full mechanistic interpretation". Those intermediate units of progress can be directly valuable in their own right, because they allow us to rule things out: (one way to frame) the whole point of screening is that lots of interactions are ruled out. And they directly steer the search for mechanistic explanations, by ruling out broad classes of models.
That is the sort of approach to mech interp which would be able to provide valuable incremental progress on large models, not just toy models, because it doesn't require understanding everything about a piece before something useful is produced.
(Side note: yet another framing of all this would be in terms of modules/modularity.)
This seems like exactly what mech interp is doing? Circuit finding is all about finding sparse subgraphs. It continues to work with large models, when trying to explain a piece of the behavior of the large model. SAE stands for sparse autoencoder: the whole point is to find the basis in which you get sparsity. I feel like a lot of mech interp has been almost entirely organized around the principle of modularity / sparsity, and the main challenge is that it's hard (you don't get to 99% of loss recovered, even on pieces of behavior, while still being meaningfully sparse).
SAEs are almost the opposite of the principle John is advocating for here. They deliver sparsity in the sense that the dictionary you get only has a few neurons not be in the zero state at the same time, they do not deliver sparsity in the sense of a low dimensional summary of the relevant information in the layer, or whatever other causal cut you deploy them on. Instead, the dimensionality of the representation gets blown up to be even larger.
My understanding was that John wanted to only have a few variables mattering on a given input, which SAEs give you. The causal graph is large in general, but IMO that's just an unavoidable property of models and superposition.
I'm confused by why you don't consider "only a few neurons being non-zero" to be a "low dimensional summary of the relevant information in the layer"
The causal graph is large in general, but IMO that's just an unavoidable property of models and superposition.
This is a discussion that would need to be its own post, but I think superposition is basically not real and a confused concept.
Leaving that aside, the vanilla reading of this claim also seems kind of obviously false for many models, otherwise optimising them in inference through e.g. low rank approximation of weight matrices would never work. You are throwing away at least one floating point number worth of description bits there.
I'm confused by why you don't consider "only a few neurons being non-zero" to be a "low dimensional summary of the relevant information in the layer"
A low-dimensional summary of a variable vector of size is a fixed set of random variables that suffice to summarise the state of . To summarise the state of using the activations in an SAE dictionary, I have to describe the state of more than variables. That these variables are sparse may sometimes let me define an encoding scheme for describing them that takes less than variables, but that just corresponds to undoing the autoencoding and then performing some other compression.
This is a discussion that would need to be its own post, but I think superposition is basically not real and a confused concept.
I'd be curious to hear more about this - IMO we're talking past each other given that we disagree on this point! Like, in my opinion, the reason low rank approximations work at all is because of superposition.
For example, if an SAE gives us 16x as many dimensions as the original activations, and we find that half of those are interpretable, to me this seems like clear evidence of superposition (8x as many interpretable directions!). How would you interpret that phenomena?
For example, if an SAE gives us 16x as many dimensions as the original activations, and we find that half of those are interpretable, to me this seems like clear evidence of superposition (8x as many interpretable directions!). How would you interpret that phenomena?
I don't have the time and energy to do this properly right now, but here's a few thought experiments to maybe help communicate part of what I mean:
Say you have a transformer model that draws animals. As in, you type “draw me a giraffe”, and then it draws you a giraffe. Unknown to you, the way the model algorithm works is that the first thirty layers of the model perform language processing to figure out what you want drawn, and output a summary of fifty scalar variables that the algorithms in the next thirty layers of the model use to draw the animals. And these fifty variables are things like “furriness”, “size”, “length of tail” and so on.
The latter half of the model does then not, in any real sense, think of the concept “giraffe” while it draws the giraffe. It is just executing purely geometric algorithms that use these fifty variables to figure out what shapes to draw.
If you then point a sparse autoencoder at the residual stream in the latter half of the model, over a data set of people asking the network to draw lots of different animals, far more than fifty or the network width, I’d guess the “sparse features” the SAE finds might be the individual animal types. “Giraffe”, “elephant”, etc. .
Or, if you make the encoder dictionary larger, more specific sparse features like “fat giraffe” would start showing up.
And then, some people may conclude that the model was doing a galaxy-brained thing where it was thinking about all of these animals using very little space, compressing a much larger network in which all these animals are variables. This is kind of true in a certain sense if you squint, but pretty misleading. The model at this point in the computation no longer “knows” what a giraffe is. It just “knows” what the settings of furriness, tail length, etc. are right now. If you manually go into the network and set the fifty variables to something that should correspond to a unicorn, the network will draw you a unicorn, even if there were no unicorns in the training data and the first thirty layers in the network don’t know how to set the fifty variables to draw one. So in a sense, this algorithm is more general than a cleverly compressed lookup table of animals would be. And if you want to learn how the geometric algorithms that do the drawing work, what they do with the fifty scalar summary statistics is what you will need to look at.
Just because we can find a transformation that turns an NNs activations into numbers that correlate with what a human observer would regard as separate features of the data, does not mean the model itself is treating these as elementary variables in its own computations in any meaningful sense.
The only thing the SAE is showing you is that the information present in the model can be written as a sum of some sparsely activating generators of the data. This does not mean that the model is processing the problem in terms of these variables. Indeed, SAE dictionaries are almost custom-selected not to give you variables that a well-generalizing algorithm would use to think about problems with big, complicated state spaces. Good summary variables are highly compositional, not sparse. They can all be active at the same time in any setting, letting you represent the relevant information from a large state space with just a few variables, because they factorise. Temperature and volume are often good summary variables for thinking about thermodynamic systems because the former tells you nothing about the latter and they can co-occur in any combination of values. Variables with strong sparsity conditions on them instead have high mutual information, making them partially redundant, and ripe for compressing away into summary statistics.
If an NN (artificial or otherwise) is, say, processing images coming in from the world, it is dealing with an exponentially large state space. Every pixel can take one of several values. Luckily, the probability distribution of pixels is extremely peaked. The supermajority of pixel settings are TV static that never occurs, and thermal noise that doesn't matter for the NNs task. One way to talk about this highly peaked pixel distribution may be to describe it as a sum of a very large number of sparse generators. The model then reasons about this distribution by compressing the many sparse generators into a small set of pretty non-sparse, highly compositional variables. For example, many images contain one or a few brown branchy structures of a certain kind, which come in myriad variations. The model summarises the presence or absence of any of these many sparse generators with the state of the variable “tree”, which tracks how much the input is “like a tree”.
If the model has a variable “tree” and a variable “size”, the myriad brown, branchy structures in the data might, for example, show up as sparsely encoded vectors in a two-dimensional (“tree”,“size”) manifold. If you point a SAE at that manifold, you may get out sparse activations like “bush” (mid tree, low size) “house” (low tree, high size), “fir” (high tree, high size). If you increase the dictionary size, you might start getting more fine-grained sparse data generators. E.g. “Checkerberry bush” and “Honeyberry bush” might show up as separate, because they have different sizes.
Humans, I expect, work similarly. So the human-like abstractions the model may or may not be thinking in and that we are searching for will not come in the form of sparse generators of layer activations, because human abstractions are the summary variables you would be using to compress these sparse generators. They are the type-of-thing you use to encode a sparse world, not the type-of-thing being encoded. That our SAE is showing us some activations that correlate with information in the input humans regard as meaningful just tells us that the data contains sparse generators humans have conceptual descriptions for, not that the algorithms of the network themselves are encoding the sparse generators using these same human conceptual descriptions. We know it hasn't thrown away the information needed to compute that there was a bush in the image, but we don't know it is thinking in bush. It probably isn't, else bush would not be sparse with respect to the other summary statistics in the layer, and our SAE wouldn't have found it.
This is a great, thought-provoking critique of SAEs.
That said, I think SAEs make more sense if we're trying to explain an LLM (or any generative model of messy real-world data) than they do if we're trying to explain the animal-drawing NN.
In the animal-drawing example:
With something like an LLM, we expect the situation to be more like:
The way I would phrase this concern is "SAEs might learn to pick up on structure present in the underlying data, rather than to pick up on learned structure in NN activations." E.g. since "tree" is a class of things defined by a bunch of correlations present in the underlying image data, it's possible that images of trees will naturally cluster in NN activations even when the NN has no underlying tree concept; SAEs would still be able to detect and learn this cluster as one of their neurons.
I agree this is a valid critique. Here's one empirical test which partially gets at it: what happens when you train an SAE on a NN with random weights? (I.e. you randomize the parameters of your NN, and then train an SAE on its activations on real data in the normal way.) Then to the extent that your SAE has good-looking features, that must be because your SAE was picking up on structure in the underlying data.
My collaborators and I did this experiment. In more detail, we trained SAEs on Pythia-70m's MLPs, then did this again but after randomizing the weights of Pythia-70m. Take a moment to predict the results if you want etc etc.
The SAEs that we trained on a random network looked bad. The most interesting dictionary features we found were features that activated on particular tokens (e.g. features that activated on the "man" feature and no others). Most of the features didn't look like anything at all, activating on a large fraction (>10%) of tokens in our data, with no obvious patterns.(The features for dictionaries trained on the non-random network looked much better.)
We also did a variant of this experiment where use randomized Pythia-70m's parameters except for the embedding layer. In this variant, the most interesting features we found were features which fired on a few closely semantically related tokens (e.g. the tokens "make," "makes," and "making").
Thanks to my collaborators for this experiment: Aaron Mueller and David Bau.
I agree that a reasonable intuition for what SAEs do is: identify "basic clusters" in NN activations (basic in the sense that you allow compositionality, i.e. you don't try to learn clusters whose centroids are the sums of the centroids of previously-learned clusters). And these clusters might exist because:
Beyond the preliminary empirics I mentioned above, I think there are some theoretical reasons to hope that SAEs will mostly learn the first type of cluster:
In Towards Monosemanticity we also did a version of this experiment, and found that the SAE was much less interpretable when the transformer weights were randomized (https://transformer-circuits.pub/2023/monosemantic-features/index.html#appendix-automated-randomized).
(The results for correlations from auto-interp are less clear: they find similar correlation coefficients with and without weight randomization. However, they find that this might be due to single token features on the part of the randomized transformer and when you ignore these features (or correct in some other way I'm forgetting?), the SAE on an actual transformer indeed has higher correlation.)
Another metric is: comparing the similarity between two dictionaries using mean max cosine similarity (where one of the dictionaries is treated as the ground truth), we've found that two dictionaries trained from different random seeds on the same (non-randomized) model are highly similar (>.95), whereas dictionaries trained on a randomized model and an non-randomized model are dissimilar (<.3 IIRC, but I don't have the data on hand).
One piece missing here, insofar as current methods don't get to 99% of loss recovered, is repeatedly drilling into the residual until they do get to 99%. That's a pretty core part of what makes science work, in general. And yeah, that's hard (at least in the sense of being a lot of work; more arguable whether it's hard in a stronger sense than that).
One piece missing here, insofar as current methods don't get to 99% of loss recovered, is repeatedly drilling into the residual until they do get to 99%.
When you do that using existing methods, you lose the sparsity (e.g. for circuit finding you have to include a large fraction of the model to get to 99% loss recovered).
It's of course possible that this is because the methods are bad, though my guess is that at the 99% standard this is reflecting non-sparsity / messiness in the territory (and so isn't going to go away with better methods). I do expect we can improve; we're very far from the 99% standard. But the way we improve won't be by "drilling into the residual"; that has been tried and is insufficient. EDIT: Possibly by "drill into the residual" you mean "understand why the methods don't work and then improve them" -- if so I agree with that but also think this is what mech interp researchers want to do.
(Why am I still optimistic about interpretability? I'm not convinced that the 99% standard is required for downstream impact -- though I am pretty pessimistic about the "enumerative safety" story of impact, basically for the same reasons as Buck and Ryan afaict.)
It's of course possible that this is because the methods are bad, though my guess is that at the 99% standard this is reflecting non-sparsity / messiness in the territory (and so isn't going to go away with better methods).
I have the opposite expectation there; I think it's just that current methods are pretty primitive.
I also think current methods are mostly focused on linear interp. But what if it just ain't linear.
It's not clear what experiments this mindset suggests. If I squint I basically get things that are already being tried, like intervention experiments to determine the connections between SAE features.
Thanks for having this dialogue -- I'm very happy to see clearer articulation of the Buck/Ryan views on theories of impact for MI work!
The part that I found most useful was Ryan's bullet points for "Hopes (as I see them) for mech interp being useful without explaining 99%". I would guess that most MI researchers don't actually see their theories of impact as relying on explaining ~all of model performance (even though they sometimes get confused/misunderstand the question and say otherwise). So I think the most important cruxes will lie in disagreements about (1) whether Ryan's list is complete, and (2) whether Ryan's concerns about the approaches listed are compelling.
Here's a hope which (I think) isn't on the list. It's somewhat related to the hope that Habryka raised, though a bit different and more specific.
Approach: maybe model internals overtly represent qualities which distinguish desired vs. undesired cognition, but probing is insufficient for some reason (e.g. because we don't have good enough oversight to produce labeled data to train a probe with).
Here's a concrete example (which is also the example I most care about). Our goal is to classify statements as true/false, given access to a model that knows the answer. Suppose our model has distinct features representing "X is true" and "humans believe X." Further suppose that on any labeled dataset we're able to create, these two features are correlated; thus, if we make a labeled dataset of true/false statements and train a probe on it, we can't tell whether the probe will generalize as an "X is true" classifier or a "humans believe X classifier." However, a coarse-grained mechanistic understanding would help here. E.g., one could identify all of the model features which serve as accurate classifiers on our dataset, and only treat statements as true if all of the features label them as true. Or if we need a lower FPR, one might be able to mechanistically distinguish these features, e.g. by noticing that one feature is causally downstream of features that look related to social reasoning and the other feature isn't.
This is formally similar to what the authors of this paper did. In brief, they were working with the Waterbirds dataset, an image classification task with lots of spuriously correlated features which are not disambiguated by the labeled data. Working with a CLIP ViT, the authors used some ad-hoc technique to get a general sense that certain attention heads dealt with concepts like "texture," "color," and "geolocation." Then they ablated the heads which seemed most likely to attend to confounding features; this resulted in a classifier which generalized in the desired way, without requiring a better-quality labeled dataset.
Curious for thoughts about/critiques of this impact story.
Here's a hope which (I think) isn't on the list. It's somewhat related to the hope that Habryka raised, though a bit different and more specific.
Approach: maybe model internals overtly represent qualities which distinguish desired vs. undesired cognition, but probing is insufficient for some reason (e.g. because we don't have good enough oversight to produce labeled data to train a probe with).
I don't think this exact thing is directly mentioned by my list. Thanks for the addition.
Let me try to state something which captures most of that approach to make sure I understand:
Approach: Maybe we can find some decomposition of model internals[1] such that all or most components directly related to some particular aspect of cognition are overtly obvious and there are also a small number of such components. Then, maybe we can analyze, edit, or build a classifier using these components in cases where baseline training techniques (e.g. probing) are insufficient.
Then, it seems like there are two cases where this is useful:
This impact story seems overall somewhat resonable to me. It's worth noting that I can't imagine this resulting in vary ambitious applications, though the reduction in doom could still be substantial. My main concerns are:
If you thought that current fundamental science in mech interp was close to doing this, I think I'd probably be excited about building test bed(s) where you think this sort of approach could be usefully applied and which aren't trivially solved by other methods. If you don't think the fundamentals of mech interp are close, it would be interesting to understand what you think will change to make this story viable in the future (better decompositions? something else?).
Either a "default" decomposition like neurons/attention heads or "non-default" decomposition like a sparse autoencoder. ↩︎
Let me try to state something which captures most of that approach to make sure I understand:
Everything you wrote describing the hope looks right to me.
It's worth noting that I can't imagine this resulting in vary ambitious applications, though the reduction in doom could still be substantial.
To be clear, what does "ambitious" mean here? Does it mean "producing a large degree of understanding?"
If we don't understand much of the training compute then there will be decompositions which look to us like a good enough decomposition while hiding arbitrary stuff in the residual between our understanding and what's going on.
[...]
If we want to look at connections, then imperfect understanding will probably bite pretty hard particularly as the effect size of the connection gets smaller and smaller (either due to path length >1 or just there being many things which are directly connected but have a small effect).
These seem like important intuitions, but I'm not sure I understand or share them. Suppose I identify a sentiment feature. I agree there's a lot of room for variation in what precise notion of sentiment the model is using, and there are lots of different ways this sentiment feature could be interacting with the network that are difficult to understand. But maybe I don't really care about that, I just want a classifier for something which is close enough to my internal notion of sentiment.
Just so with truth: there's probably lots of different subtly different notions of truth, but for the application of "detecting whether my AI believes statement X to be true" I don't care about that. I do care about the difference between "true" and "humans think is true," but that's a big difference that I can understand (even if I can't produce examples), and where I can articulate the sorts of cognition which probably should/shouldn't be involved in it.
What's the specific way you imagine this failing? Some options:
Maybe a better question would be - why didn't these issues (lack of robust explanation) get in the way of the Steinhardt paper I linked? They were in fact able to execute something like the plan I sketch here: use vague understanding to guess which model components attend to features which are spuriously correlated with the thing you want, then use the rest of the model as an improved classifier for the thing you want.
What's the specific way you imagine this failing? Some options:
My proposed list (which borrows from your list):
These seem like important intuitions, but I'm not sure I understand or share them. Suppose I identify a sentiment feature. I agree there's a lot of room for variation in what precise notion of sentiment the model is using, and there are lots of different ways this sentiment feature could be interacting with the network that are difficult to understand. But maybe I don't really care about that, I just want a classifier for something which is close enough to my internal notion of sentiment.
Sure, but then why not just train a probe? If we don't care about much precision what goes wrong with the probe approach?
It's possible to improve on a just a probe trained on the data we can construct of course, but you'll need non-trivial precision to do so.
The key question here is "why does selecting a feature work while just naively training a probe fails".
We have to be getting some additional bits from our selection of the feature.
In more detail, let's suppose we use the following process:
Then there are two issues:
IMO there are two kinda separate (and separable) things going on:
Maybe looking at the connections of your classifer (what earlier features it connects to and what these connect to) and applying selection to the classifer based on the connections will be good. This can totally be applied to probes. (Maybe there is some reason why looking at connections will be especially good for features but not probes, but if so, why?)
"Can this be applied to probes" is a crux for me. It sounds like you're imagining something like:
Is that right?
This is not an option I had considered, and it would be very exciting to me if it worked. I have some vague intuition that this should all go better when you are working with features (e.g. because the causal dependencies among the features should be sparse), but I would definitely need to think about that position more.
"Can this be applied to probes" is a crux for me. It sounds like you're imagining something like:
I was actually imagining a hybrid between probes and features. The actual classifier doesn't need to be part of a complete decomposition, but for the connections we do maybe want the complete decomposition to fully analyze connections including the recursive case.
So:
I also think there's a pretty straightforward to do this without needing to train a bunch of probes (e.g. train probes to be orthogonal to undesirable stuff or whatever rather than needing to train a bunch).
As you mentioned you probably can do with entirely just learned probes, via a mechanism like the one you said (but this is less clean than decomposition).
You could also apply amnesic probing which has somewhat different properties than looking at the decomposition (amnesic probing is where you remove some dimensions to avoid being able to discriminate certain classes via LEACE as we discussed in the measurement tampering paper).
(TBC, doesn't seem that useful to argue about "what is mech interp", I think the more central question is "how likely is it that all this prior work and ideas related to mech interp are useful". This is a strictly higher bar, but we should apply the same adjustment for work in all other areas etc.)
More generally, it seems good to be careful about thinking through questions like "does using X have a principled reason to be better than applying the 'default' approach (e.g. training a probe)". Good to do this regardless of actually using the default approach so we know where the juice is coming from.
In the case of mech interp style decompositions, I'm pretty skeptical that there is any juice in finding your classifier by doing something like selecting over components rather than training a probe. But, there could theoretically be juice in trying to understand how a probe works by looking at its connections (and the connections of its connections etc).
Sure, but then why not just train a probe? If we don't care about much precision what goes wrong with the probe approach?
Here's a reasonable example where naively training a probe fails. The model lies if any of N features is "true". One of the features is almost always activated at the same time as some others, such that in the training set it never solely determines whether the model lies.
Then, a probe trained on the activations may not pick up on that feature. Whereas if we can look at model weights, we can see that this feature also matters, and include it in our lying classifier.
This particular case can also be solved by adversarially attacking the probe though.
Maybe a better question would be - why didn't these issues (lack of robust explanation) get in the way of the Steinhardt paper I linked? They were in fact able to execute something like the plan I sketch here: use vague understanding to guess which model components attend to features which are spuriously correlated with the thing you want, then use the rest of the model as an improved classifier for the thing you want.
My guess is that the classification task for waterbirds is sufficiently easy that butchering a substantial part of the model is fine. It won't usually be viable to ablate everything that looks similar to an undesirable property. In some cases, this might be fine due to redundancy, but if there is heavy redundancy, I also expect that you've missed some stuff if you just look for components which look to a given target.
Not super high confidence overall.
Edit: it also seems likely to me that there is a more principled and simpler approach like using LEACE which works just as well or better (but I'm unsure and I'm not familiar with that paper or the literature here).
To be clear, what does "ambitious" mean here? Does it mean "producing a large degree of understanding?"
I mean reducing doom by a large amount for very powerful models.
Curated. I’ve often felt that mech interp seems like a subfield of alignment work that’s good traction and is making progress. And in my experience that’s a fairly common view. If that were true, that would be a pretty big deal. It might be that we could make a big dent with a pretty scalable field of research. So it seems pretty valuable to read thoughtful arguments to the contrary.
I gotta say I have some hesitation in curating this dialogue. The best and most informative topics are kind of diffused over the dialogue, and I feel like I never quite get enough depth or concreteness to really think through the claims. I think my main takeaways are:
(a) presumably a lot of safety-relevant stuff is in the diff between weaker and stronger models, and so you have to think about how you’d tell you’re explaining that diff, and
(b) a question about whether ‘induction heads exist’ or not, and what that means for whether or not mech interp has started making meaningful progress.
It’s not clear how mech interp could provide strong evidence against scheming (deceptive alignment), even if mech interp were much better.
I think an important question here is "how much of the model is in on the scheming?": what proportion of the circuits in it differ significantly from the most similar non-schemeing model. If the answer was, say, >=1%, then once our mechanical interpretability was good enough to reliably tell a scheming circuit from a non-scheming circuit (say, by identifying that the former had a suspicious conditional dependency on some network of "deceitful mode vs. powerseeking mode" flags that can switch it over to another behavior pattern), then even looking at several hundred randomly selected circuits should normally let us detect the schemingness.
There are a couple of possibilities here:
a) The model was learning the human phenomenon of a scheming liar who skillfully pretends to be X when they are actually Y, without much foreshadowing/hints (usually the author gives the reader hints) — a moderately complicated mechanism, but one that it clearly needs to develop in order to model that aspect of real human behavior, and then it started also using this mechanism extensively itself at other times. In this case we could intentional activate this mechanism using text illustrating this human behavior, locate it, then map out its influence.
b) The model somehow spontaneously either developed a separate copy of this mechanism, separate from the one it uses for modeling this behavior in human, or repurposed a spare copy. In this case there's no clear way to trigger this mechanism to help us locate it, [It's unclear how/why a second such mechanism could arise (or get repurposed) during SGD, since until the mechanism is very reliable it would be unhelpful and get penalized (Singular Learning Theory suggests Occam's razor applies to LLMs), and once the mechanism was very reliable, if it never used its "now show my true intent Y" mode during training, it seems like that half of the behavior pattern would inevitably bitrot/get catastrophically forgotten. So to be maintained, the mechanism must use that other mode sometimes, but figuring out how to trigger it is nontrivial.]
A distinctly labor-intensive approach to dealing with the repurposing variant of b) (it would need to be automated) would be to keep numerous checkpoints during the training run, look at each of them, identify copies of mechanism a) as they arise, and watch for one of them getting repurposed.
Setting up a model too small to be dangerous where the repurposing version of b) was incentivized and then demonstrating that it occurs seems like it would give us a lot of information.
If you have <98% perf explained (on webtext relative to unigram or bigram baseline), then you degrade from GPT4 perf to GPT3.5 perf
Two quick thoughts on why this isn't as concerning to me as this dialogue emphasized.
1. If we evaluate SAEs by the quality of their explanations on specific narrow tasks, full distribution performance doesn't matter
2. Plausibly the safety relevant capabilities of GPT (N+1) are a phase change from GPT N, meaning much larger loss increases in GPT (N+1) when attaching SAEs are actually competitive with GPT N (ht Tom for this one)
On (1), I agree, if you could explain 80% of GPT-4 performance on a task and metric where GPT-3.5 performs 1/2 as well as GPT-4 than that would suffice for showing something interesting not in GPT-3.5. For instance, if an explanation was able to human interpretably explain 80% of GPT-4's accuracy on solving APPS programing problems, then that accuracy would be higher than GPT-3.5.
However, I expect that performance on these sorts of tasks is pretty sensitive such that getting 80% of performance is much harder than getting 80% of loss recovered on web text. Most prior results look at explaning loss on webtext or a narrow distribution of webtext, not on trying to preserve downstream performance on some task.
There are some reasons why it could be easier to explain a high fraction of training compute in downstream task performance (e.g. it's a task that humans can do as well as models), but also some annoyances related to only having a smaller amount of data.
I'm skeptical that (2) will qualitatively matter much, but I can see the intuition.
"We reverse engineered every single circuit and can predict exactly what the model will do using our hand-crafted code" seems like it's setting the bar way too high for MI.
Instead, aim for stuff like the AI Lie Detector, which both 1) works and 2) is obviously useful.
To do a Side-Channel Attack on a system, you don't have to explain every detail of the system (or even know physics at all). You just have to find something in the system that is correlated with the thing you care about.
But certain details there are still somewhat sketchy, in particular we don't have a detailed understanding of the attention circuit, and replacing the query with "the projection onto the subspace we thought was all that mattered" harmed performance significantly (down to 30-40%).
@Neel Nanda FYI my first thought when reading that was "did you try adding random normal noise along the directions orthogonal to the subspace to match the typical variance along those directions?". Mentioning in case that's a different kind of thing than you'd already thought of.
Hopefully, even if we didn't get all the way there, this dialogue can still be useful in advancing thinking about mech interp.
I hope you guys repeat this dialogue again, as I think these kinds of drilled-down conversations will improve the community's ideas on how to do and teach mechanistic interpretability.
I'm a new OpenPhil fellow for a mid-career transition -- from other spaces in AI/ML -- into AI safety, with an interest in interpretability. Given my experience, I bias towards intuitively optimistic about mechanical interpretability in the sense of discovering representations and circuits and trying to make sense of them. But I've started my deep dive into the literature. I'd be really interested to hear from @Buck and @ryan_greenblatt and those who share their skepticism about what directions they prefer to invest for their own and their team's research efforts!
Out of the convo and the comments I got relying more on probes rather than dictionaries and circuits alone. But I feel pretty certain that's not the complete picture! I came to this convo from the Causal Scrubbing thread which felt exciting to me and like a potential source of inspiration for a mini research project for my fellowship (6 months, including ramp up/learning). I was a bit bummed to learn that the authors found the main benefit of that project to be informing them to abandon mech interp :-D
On a related note, one of the other papers that put me on a path to this thread was this one on Causal Mediation. Fairly long ago at this point I had a phase of interest in Pearl's causal theory and thought that paper was a nice example of thinking about what's essentially ablation and activation patching from that point of view. Are there any folks who've taken a deeper stab at leveraging some of the more recent theoretical advances in graphical causal theory to do mech interp? Would super appreciate any pointers!
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
other model internals techniques
What are these? I'm confused about the boundary between mechinterp and others.
By mech interp I mean "A subfield of interpretability that uses bottom-up or reverse engineering approaches, generally by corresponding low-level components such as circuits or neurons to components of human-understandable algorithms and then working upward to build an overall understanding."
For examples of non-mech interp model internals, see here, here, and here. (Though all of these methods are quite simple.)
I think like 99% reliability is about the right threshold for large models based on my napkin math.
Serious question. We have 100% of the information, why can't we get 100%
Suggestion: Why not test if mechanistic interp can detect lies, for out of distribution data 99% of the time? (It should also generalise to larger models)
It's a useful and well studied benchmark. And while we haven't decided on a test suite, [there is some useful code](https://github.com/EleutherAI/elk).
This is refering to 99% in the context of "amount of loss that you explain in a human interpretable way for some component in the model" (a notion of faithfulness). For downstream tasks, either much higher or much lower reliability could be the right target (depending on the exact task).
Other approaches of alignment are just as deserving to be skeptical of as mechanistic interpretability if faced with as much scrutiny.
My greatest hopes for mechanistic interpretability do not seem represented, so allow me to present my pet direction.
You invest many resources in mechanistically understanding ONE teacher network, within a teacher-student training paradigm. This is valuable because now instead of presenting a simplistic proxy training signal, you can send an abstract signal with some understanding of the world. Such a signal is harder to "cheat" and "hack".
If we can fully interpret and design that teacher network, then our training signals can incorporate much of our world model and morality. True this requires us to become philosophical and actually consider what such a world model and morality is... but at least in this case we have a technical direction. In such an instance a good deal of the technical aspects of the alignment problem is solved. (at least in aligning AI-to-human not human-to-human).
This argument says all mechanistic interpretability effort could be focused on ONE network. I concede this method requires the teacher to have a decent generalizable world model... At which point, perhaps we are already in the danger zone.
Could you say more? Why would a teacher network be more capable of training a student network than literal humans? By what mechanism do you expect this teacher network to train other networks in a way that benefits from us understanding its internals?
Teacher-student training paradigms are not too uncommon. Essentially the teacher network is "better" than a human because you can generate far more feedback data and it can react at the same speed as the larger student network. Humans also can be inconsistent, etc.
What I was discussing is that currently with many systems (especially RL systems) we provide a simple feedback signal that is machine interpretable. For example, the "eggs" should be at coordinates x, y. But in reality, we don't want the eggs at coordinates x, y we just want to make an omelet.
So, if we had a sufficiently complex teacher network it could understand what we want in human terms, and it could provide all the training signal we need to teach other student networks. In this situation, we may be able to get away with only ever fully mechanistically understanding the teacher network. If we know it is aligned, it can keep up and provide a sufficiently complex feedback signal to train any future students and make them aligned.
If this teacher network has a model of reality that models our morality and the complexity of the world then we don't fall into the trap of having AI doing stupid things like killing everyone in the world to cure cancer. The teacher network's feedback is sufficiently complex that it would never allow such actions to provide value in simulations, etc.
ryan_greenblatt – By mech interp I mean "A subfield of interpretability that uses bottom-up or reverse engineering approaches, generally by corresponding low-level components such as circuits or neurons to components of human-understandable algorithms and then working upward to build an overall understanding."
That makes sense to me, and I think it is essential that we identify those low-level components. But I’ve got problems with the “working upward” part.
The low-level components of a gothic cathedral, for example, consist of things like stone blocks, wooden beams, metal hinges and clasps and so forth, pieces of colored glass for the windows, tiles for the roof, and so forth. How do you work upward from a pile of that stuff, even if neatly organized and thoroughly catalogues, how do you get from there to the overall design of the overall cathedral. How, for example, can you look at that and conclude, “this thing’s going to have flying buttresses to support the roof?”
Somewhere in “How the Mind Works” Steven Pinker makes the same point in explaining reverse engineering. Imagine you’re in an antique shop, he suggests, and you come across odd little metal contraption. It doesn’t make any sense at all. The shop keeper sees your bewilderment and offers, “That’s an olive pitter.” Now that contraption makes sense. You know what it’s supposed to do.
How are you going to make sense of those things you find under the hood unless you have some idea of what they’re supposed to do?
Opening positions
Do induction heads and French neurons exist?
What is the bar for a mechanistic explanation?
Could mechanistic interpretability rule out deceptive alignment?
Hopes for mechanistic interpretability being useful
Immediately-relevant mechanistic interpretability projects