Comment Permalink

This is a good summary of our results, but just to try to express a bit more clearly why you might care...

I think there are presently two striking facts about overfitting and mechanistic interpretability:

(1) The successes of mechanistic interpretability have thus far tended to focus on circuits which seem to describe clean, generalizing algorithms which one might think of as the "non-overfitting parts of neural networks". We don't really know what "overfitting mechanistically is", and you could imagine a world where it's so fundamentally messy we just can't understand it!

(2) There's evidence that more overfit neural networks are harder to understand.

A pessimistic interpretation of this could be something like: Overfitting is fundamentally a messy kind of computation we won't ever cleanly understand. We're dealing with pathological models/circuits, and if we want to understand neural networks, we need to create non-overfit models.

In the case of vision, that might seem kind of sad but not horrible: you could imagine creating larger and larger datasets that reduce overfitting. ImageNet models are more interpretable than MNIST ones and perhaps that's why. But language models seem like they morally should memorize some data points. Language models should recite the US constitution and Shakespeare and the Bible. So we'd really like to be able to understand what's going on.

The naive mechanistic hypothesis for memorization/overfitting is to create features, represented by neurons, which correspond to particular data points. But there's a number of problems with this:

It seems incredibly inefficient to have neurons represent data points. You probably want to memorize lots of data points once you start overfitting!
We don't seem to observe neurons that do this -- wouldn't it be obvious if they existed?

The obvious response to that is "perhaps it's occurring in superposition."

So how does this relate to our paper?

Firstly, we have an example of overfitting -- in a problem which wasn't specifically tuned for overfitting / memorization -- which from a naive perspective looks horribly messy and complicated but turns out to be very simple and clean. Although it's a toy problem, that's very promising!

Secondly, what we observe is exactly the naive hypothesis + superposition. And in retrospect this makes a lot of sense! Memorization is the ideal case for something like superposition. Definitionally, a single data point feature is the most sparse possible feature you can have.

Thirdly, Adam Jermyn's extension to repeated data shows that "single data point features" and "generalizing features" can co-occur.

The nice double descent phase change is really just the cherry on the cake. The important thing is having these two regimes where we represent data points vs features.

There's one other reason you might care about this: it potentially has bearing on mechanistic anomaly detection.

Perhaps the clearest example of this is Adam Jermyn's follow up with repeated data. Here, we have a model with both "normal mechanisms" and "hard coded special cases which rarely activate". And distinguishing them would be very hard if one didn't understand the superposition structure!

Our experiment with extending this to MNIST, although obviously also very much a toy problem, might be interpreted as detecting "memorized training data points" which the model does not use its normal generalizing machinery for, but instead has hard coded special cases. This is a kind of mechanistic anomaly detection, albeit within the training set. (But I kind of think that alarming machinery must form somewhere on the training set.)

One nice thing about these examples is that they start to give a concrete picture of what mechanistic anomaly detection might look like. Of course, I don't mean to suggest that all anomalies would look like this. But as someone who really values concrete examples, I find this useful in my thinking.

These results also suggest that if superposition is widespread, mechanistic anomaly detection will require solving superposition. My present guess (although very uncertain) is that superposition is the hardest problem in mechanistic interpretability. So this makes me think that anomaly detection likely isn't a significantly easier problem than mechanistic interpretability as a whole.

All of these thoughts are very uncertain of course.

Reply

Showing 3 of 10 replies (Click to show all)

13nostalgebraist2y

An operational definition which I find helpful for thinking about memorization is Zhang et al's counterfactual memorization. The counterfactual memorization of a document x is (roughly) the amount that the model's loss on x degrades when you remove x from its training dataset. More precisely, it's the difference in expected loss on x between models trained on data distribution samples that happen to include x, and models trained on data distribution samples that happen not to include x. This will be lower for documents that are easy for the LM to predict using general features learned elsewhere, and higher for documents that the LM can't predict well except by memorizing them. For example (these are intuitive guesses, not experimental results!): * A document xUUID containing a list of random UUIDs will have higher counterfactual memorization than a document xREP containing the word "the" repeated many times. * If we extend the definition slightly to cover training sets with fewer or more copies of a document x, then a document repeated many times in the training set will have higher counterfactual memorization than a document that appears only once. * Repeating xUUID many times, or doing many epochs over it, will produce more counterfactual memorization than doing the same thing with xREP. (The counterfactual memorization for xREP is upper bounded by the loss on xREP attained by a model that never even sees it once in training, and that's already low to begin with.) Note that the true likelihood under the data distribution only matters through its effect on the likelihood predicted by the LM. On average, likely texts will be easier than unlikely ones, but when these two things come apart, easy-vs-hard is what matters. xUUID is more plausible as natural text than xREP, but it's harder for the LM to predict, so it has higher counterfactual memorization. ---------------------------------------- On the other hand, if we put many near duplicates of a docume

Neel Nanda2yΩ240

Super interesting, thanks! I hadn't come across that work before, and that's a cute and elegant definition.

To me, it's natural to extend this to specific substrings in the document? I believe that models are trained with documents chopped up and concatenated to fit into segment that fully fit the context window, so it feels odd to talk about document as the unit of analysis. And in some sense a 1000 token document is actually 1000 sub-tasks of predicting token k given the prefix up to token k-1, each of which can be memorised.

Maybe we should just not apply a gradient update to the tokens in the repeated substring? But keep the document in and measure loss on the rest.

Reply

1ryan_greenblatt2y

Overall, my view is that we will need to solve the optimization problem of 'what properties of the activation distribution are sufficient to explain how the model behaves', but this solution can be represented somewhat implicitly and I don't currently see how you'd transition it into a solution to superposition in the sense I think you mean. I'll try to explain why I have this view, but it seems likely I'll fail (at least partially because of my own confusions). Quickly, some background so we're hopefully on the same page (or at least closer): I'm imagining the setting described here. Note that anomalies are detected with respect to a distribution D (for a new datapoint x∗! So, we need a distribution where we're happy with the reason why the model works. This setting is restrictive in various ways (e.g., see here), but I think that practical and robust solutions would be a large advancement regardless (extra points for an approach which fails to have on paper counterexamples). Now the approaches to anomaly detection I typically think about work roughly like this: * Try to find an 'explanation'/'hypothesis' for the variance on D which doesn't also 'explain' deviation from the mean on x∗. (We're worst casing over explanations) * If we succeed, then x∗ is anomalous. Otherwise, it's considered non-anomalous. Note that I'm using scare quotes around explanation/hypothesis - I'm refering to an object which matches some of the intutive properties of explanations and/or hypotheses, but it's not clear exactly which properties we will and won't need. This stated approach is very inefficient (it requires learning an explanation for each new datum x∗!), but various optimizations are plausible (e.g., having a minimal base explanation for D which we can quickly finetune for each datum x∗). I'm typically thinking about anomaly detection schemes which use approaches similar to causal scrubbing, though Paul, Mark, and other people at ARC typically think about heuristic arg

See in context

53 Paper: Superposition, Memorization, and Double Descent (Anthropic)

by LawrenceC

5th Jan 2023

AI Alignment Forum

2 min read

11

53 Ω 31

This is a linkpost for https://transformer-circuits.pub/2023/toy-double-descent/index.html

(This is a follow up to Anthropic's prior work on Toy Models of Superposition.)

The authors study how neural networks interpolate between memorization and generalization in the "ReLU Output" toy model from the first toy model paper:

They train models to perform a synthetic regression task with training points, for models with $m$ hidden dimensions.

First, they find that for small training sets, while the features are messy, the training set hidden vectors $h_{i} = W X_{i}$ (the projection of the input datapoints into the hidden space) often show clean structures:

They then extend their old definition of feature dimensionality to measure the dimensionalities allocated to each of the training examples:

$D_{X_{i}} = \frac{| | h_{i} | |^{2}}{\sum_{j} (_{i} \cdot h_{j})^{2}}$
and plot this against the data set size (and also test loss):

This shows that as you increase the amount of data, you go from a regime with high dimensionality allocated to training vectors and low dimensionality allocated to features, to one where the opposite is true. In between the two, both feature and hidden vector dimensionalities receive low dimensionality, which coincides with an increase in test loss, which they compare to the phenomena of "data double descent" (where as you increase data on overparameterized models with small amounts of regularization, test loss can go up before it goes down).

Finally, they visualize how varying $T$ and $m$ affects test loss, and find double descent along both dimensions:

They also included some (imo very interesting) experiments from Adam Jermyn, 1) replicating the results, 2) exploring how weight decay interacts with this double descent-like phenomenon, and 3) studying what happens if you repeat particular datapoints.

Some limitations of the work, based on my first read through:

The authors note that the results seem quite sensitive to hyperparameters, especially for low hidden dimension $m \in {2, 3}$ . For example, Adam Jermyn's results differ from the Anthropic Interp team results (though the figures still look qualitatively similar).
I'm still not super convinced how much results from the superposition work apply in practice. I'd be interested in seeing more work along the lines of the MNIST preliminary experiment done by Chris Olah at the bottom.

(I'll probably have more thoughts as I think for longer.)

Bucket ErrorsInterpretability (ML & AI)Machine Learning (ML)Double DescentSuperpositionAI

Frontpage

53 Ω 31

New Comment

11 comments, sorted by

top scoring

Click to highlight new comments since: Today at 11:32 AM

[-]Christopher Olah2y*Ω29508