All of Christopher Olah's Comments + Replies

We certainly think that abrupt changes of safety properties are very possible! See discussion of how the most pessimistic scenarios may seem optimistic until very powerful systems are created in this post, and also our paper on Predictability and Surprise.

With that said, I think we tend to expect a bit of continuity. Empirically, even the "abrupt changes" we observe with respect to model size tend to take place over order-of-magnitude changes in compute. (There are examples of things like the formation of induction heads where qualitative changes in model ... (read more)

2Roman Leventov
In a larger picture, you should also factor in the probability that the oversight (over the breeding of misaligned tendencies) will always be vigilant. The entire history of safety science tells us that this is unlikely, or downright impossible. Mistakes will happen, "obligatory checks" do get skipped, and entirely unanticipated failure modes do emerge. And we should convince ourselves that nothing of this will happen, with decent probability, from the moment of the first AGI deployment, until "the end of time" (or practically, we should show, theoretically, that the ensuing recursive self-improvement or quasi-self-improvement sociotechnical dynamics will only coverage to more resilience rather than less resilience). This is very hard to demonstrate, but it must be done to justify AGI deployment. I didn't see evidence in the post that Anthropic appreciates this angle of looking at the problem enough.

Unfortunately, I don't think a detailed discussion of what we regard as safe to publish would be responsible, but I can share how we operate at a procedural level. We don't consider any research area to be blanket safe to publish. Instead, we consider all releases on a case by case basis, weighing expected safety benefit against capabilities/acceleratory risk. In the case of difficult scenarios, we have a formal infohazard review procedure.

6Gabe M
Thanks for the response, Chris, that makes sense and I'm glad to read you have a formal infohazard procedure!

how likely does Anthropic think each is? What is the main evidence currently contributing to that world view?

I wouldn't want to give an "official organizational probability distribution", but I think collectively we average out to something closer to "a uniform prior over possibilities" without that much evidence thus far updating us from there. Basically, there are plausible stories and intuitions pointing in lots of directions, and no real empirical evidence which bears on it thus far.

(Obviously, within the company, there's a wide range of views. Some pe... (read more)

aysja
289

The weird thing about a portfolio approach is that the things it makes sense to work on in “optimistic scenarios” often trade off against those you’d want to work on in more “pessimistic scenarios,” and I don't feel like this is really addressed.

Like, if we’re living in an optimistic world where it’s pretty chill to scale up quickly, and things like deception are either pretty obvious or not all that consequential, and alignment is close to default, then sure, pushing frontier models is fine. But if we’re in a world where the problem is nearly impossible, ... (read more)

4TekhneMakre
There's a unilateralism problem with doing risky stuff in the name of picking up probability mass, where one has knightian uncertainty about whether that probability mass is actually there. If people end up with a reasonable distribution over alignment difficulty, plus some noise, then the people with noise that happened to make their distribution more optimistic will view it as more worth while to trade off accelerated timelines for alignment success in medium-difficulty worlds. Mostly people should just act on their inside view models, but it's pretty concerning to have another major org trying to have cutting-edge capabilities. The capabilities are going to leak out one way or another and are going to contribute to races.
-2Noah L.
What are the strategic reasons for prioritizing work on intermediate difficulty problems and "easy safety techniques" at this time?

I moderately disagree with this? I think most induction heads are at least primarily induction heads (and this points strongly at the underlying attentional features and circuits), although there may be some superposition going on. (I also think that the evidence you're providing is mostly orthogonal to this argument.)

I think if you're uncomfortable with induction heads, previous token heads (especially in larger models) are an even more crisp example of an attentional feature which appears, at least on casual inspection, to typically be monosematnica... (read more)

3Buck
Upon further consideration, I think you're probably right that the causal scrubbing results I pointed at aren't actually about the question we were talking about, my mistake. Seems like probably the optimal strategy. Thanks again for your thoughts here.

Can I summarize your concerns as something like "I'm not sure that looking into the behavior of "real" models on narrow distributions is any better research than just training a small toy model on that narrow distribution and interpreting it?" Or perhaps you think it's slightly better, but not considerably?

Between the two, I might actually prefer training a toy model on a narrow distribution! But it depends a lot on exactly how the analysis is done and what lessons one wants to draw from it.

Real language models seem to make extensive use of superpositio... (read more)

Regarding the more general question of "how much should interpretability make reference to the data distribution?", here are a few thoughts:

Firstly, I think we should obviously make use of the data distribution to some extent (and much of my work has done so!). If you're trying to reverse engineer a regular computer program, it's extremely useful to have traces of that program running. So too with neural networks!

However, the fundamental thing I care about is understanding whether models will be safe off-distribution, so an understanding which is tied to a... (read more)

9Buck
I’m sympathetic to many of your concerns here. It seems to me like the induction head mechanism as described in A Mathematical Framework is an example of just looking at what a part of a model does on a particular distribution, given that those heads also do some unspecified amount of non-induction behaviors with non-induction mechanisms, as eg discussed here https://www.alignmentforum.org/posts/Si52fuEGSJJTXW9zs/behavioral-and-mechanistic-definitions-often-confuse-ai . (Though there’s a big quantitative difference—the distribution where induction happens is way bigger than eg the distribution where IOI happens.) Do you agree?
2ryan_greenblatt
(Context, I work at Redwood) Can I summarize your concerns as something like "I'm not sure that looking into the behavior of "real" models on narrow distributions is any better research than just training a small toy model on that narrow distribution and interpreting it?" Or perhaps you think it's slightly better, but not considerably? If so, I mostly agree --- it doesn't very clear this is much better. I'm something like into: * Picking a distribution * Training a model to perform well on that distribution * Interpreting the model (or parts of the model, etc) as a default interpretability workflow. For instance, it's not very clear to me that IOI is much more interesting that just training a model on some version of the IOI distribution and then interpreting that model. And I think a key problem with IOI is that the model doesn't really care very much about doing well on this exact task: after having skimmed though copious amounts[1] of OpenWebText, the IOI task as exactly formulated seems pretty non-central IMO. There are various arguments for looking into narrow examples IMO, but the case is a bit more subtle. (For instance it seems like we should ideally be able to answer questions like 'why did this model have strange behavior in this narrow distribution' where the 'why' will probably have to make reference to how the model behaves on a broader distribution of interest) It's also possible we disagree about how useful it is to do interpretability on toy tasks. I'm not really sure if there's anything interesting and quick to say here. ---------------------------------------- 1. I've perhaps skimmed somewhere between 10,000 to 100,000 passages? (I haven't counted) ↩︎

Thanks for writing this up. It seems like a valuable contribution to our understanding of one-layer transformers. I particularly like your toy example – it's a good demonstration of how more complicated behavior can occur here.

For what it's worth, I understand this behavior as competition between skip-trigrams. We introduce "skip-trigrams" as a way to think of pairs of entries in the OV and QK-circuit matrices. The QK-circuit describes how much the attention head wants to attend to a given token in the attention softmax and implement a particular skip... (read more)

I'm curious how you'd define memorisation? To me, I'd actually count this as the model learning features ...

Qualitatively, when I discuss "memorization" in language models, I'm primarily referring to the phenomenon of languages models producing long quotes verbatim if primed with a certain start. I mean it as a more neutral term than overfitting.

Mechanistically, the simplest version I imagine is a feature which activates when the preceding N tokens match a particular pattern, and predicts a specific N+1 token. Such a feature is analogous to the "single ... (read more)

In this toy model, is it really the case that the datapoint feature solutions are "more memorizing, less generalizing" than the axis-aligned feature solutions? I don't feel totally convinced of this.

Well, empirically in this setup, (1) does generalize and get a lower test loss than (2).  In fact, it's the only version that does better than random. 🙂

But I think what you're maybe saying is that from the neural network's perspective, (2) is a very reasonable hypothesis when T < N, regardless of what is true in this specific setup.  And you could... (read more)

I feel pretty confused, but my overall view is that many of the routes I currently feel are most promising don't require solving superposition.

 

It seems quite plausible there might be ways to solve mechanistic interpretability which frame things differently. However, I presently expect that they'll need to do something which is equivalent to solving superposition, even if they don't solve it explicitly. (I don't fully understand your perspective, so it's possible I'm misunderstanding something though!)

To give a concrete example (although this is easie... (read more)

1ryan_greenblatt
Overall, my view is that we will need to solve the optimization problem of 'what properties of the activation distribution are sufficient to explain how the model behaves', but this solution can be represented somewhat implicitly and I don't currently see how you'd transition it into a solution to superposition in the sense I think you mean. I'll try to explain why I have this view, but it seems likely I'll fail (at least partially because of my own confusions). Quickly, some background so we're hopefully on the same page (or at least closer): I'm imagining the setting described here. Note that anomalies are detected with respect to a distribution D (for a new datapoint x∗! So, we need a distribution where we're happy with the reason why the model works. This setting is restrictive in various ways (e.g., see here), but I think that practical and robust solutions would be a large advancement regardless (extra points for an approach which fails to have on paper counterexamples). Now the approaches to anomaly detection I typically think about work roughly like this: * Try to find an 'explanation'/'hypothesis' for the variance on D which doesn't also 'explain' deviation from the mean on x∗. (We're worst casing over explanations) * If we succeed, then x∗ is anomalous. Otherwise, it's considered non-anomalous. Note that I'm using scare quotes around explanation/hypothesis - I'm refering to an object which matches some of the intutive properties of explanations and/or hypotheses, but it's not clear exactly which properties we will and won't need. This stated approach is very inefficient (it requires learning an explanation for each new datum x∗!), but various optimizations are plausible (e.g., having a minimal base explanation for D which we can quickly finetune for each datum x∗). I'm typically thinking about anomaly detection schemes which use approaches similar to causal scrubbing, though Paul, Mark, and other people at ARC typically think about heuristic arg

Sorry for not noticing this earlier. I'm not very active on LessWrong/AF. In case it's still helpful, a couple of thoughts...

Firstly, I think people shouldn't take this graph too seriously! I made it for a talk in ~2017, I think and even then it was intended as a vague intuition, not something I was confident in. I do occasionally gesture at it as a possible intuition, but it's just a vague idea which may or may not be true.

I do think there's some kind of empirical trend where models in some cases become harder to understand and then easier. For ... (read more)

This is a good summary of our results, but just to try to express a bit more clearly why you might care...

I think there are presently two striking facts about overfitting and mechanistic interpretability:

(1) The successes of mechanistic interpretability have thus far tended to focus on circuits which seem to describe clean, generalizing algorithms which one might think of as the "non-overfitting parts of neural networks".  We don't really know what "overfitting mechanistically is", and you could imagine a world where it's so fundamentally messy we jus... (read more)

5Neel Nanda
Interesting context, thanks for writing it up! I'm curious how you'd define memorisation? To me, I'd actually count this as the model learning features - a bunch of examples will contain the Bible verse as a substring, and so there's a non-trivial probability that any input contains it, so this is a genuine property of the data distribution. It feels analogous to the model learning bigrams or trigrams, which are basically memorising 2 or 3 token substrings - in some sense this is memorisation, but to me it's a genuine feature of the data distribution. My best attempt to operationalise memorisation is that it's about ways that the training data that differs from the training data distribution. If some string has infinitessimal probability of occurring in a randomly sampled data point, but occurs in a single training example, then that feels like memorisation. But if something occurs in a bunch of training examples, it probably occurs with non-trivial probability in any sample from data distribution. Alternately, it's memorisation if it results in significantly better loss on the training set than test set (assuming they're from the same data distribution). I might be relying too hard on the notion of a data distribution though. This also feels like an interesting difference between continuous data like images and discrete data like language - language can often have identical substrings in a way that seems much weirder in images (patches that are identical down to the pixel level?), so it feels harder to disentangle memorisation from generalisation in language.
4nostalgebraist
Interesting stuff! In this toy model, is it really the case that the datapoint feature solutions are "more memorizing, less generalizing" than the axis-aligned feature solutions?  I don't feel totally convinced of this. Two ways to look at the toy problem: 1. There are N sparse features, one per input and output channel.  2. There are T sparse features, one per data point, and each one is active only on its data point.   The features are related to the input basis by some matrix M∈RN×T. There are some details of the toy model that put (2) on a "different footing" from (1). Since the input and output use the same basis, if we make a change of basis, we have to change back again at the end.  And because the weights are tied, these two operations have to be transposes, i.e. the change of basis has to be a rotation. As illustrated in the Colab, requiring the data to be orthonormal is sufficient for this.  The experiment constrained the data to unit norm, and it's close to orthogonal with high probability for T≪N. Now, it happens that (1) is the true data-generating process, but the model has no way of guessing that.  In the finite-data case, the data may be consistent with multiple data-generating processes, and a solution that generalizes well with respect to one of them may generalize poorly with respect to another. To designate one data-generating process as the relevant one for generalization, we have to make a value judgment about which hypotheses are better, among those that explain the data equally well. In particular, when T<N, hypothesis (2) seems more parsimonious than hypothesis (1): it explains the data just as well with fewer features!  The features aren't axis-aligned like in (1), but features in real problems won't be axis-aligned either. In some sense, it does feel like there's a suspicious lack of generalization in (2).  Namely, that no generalization is made between the training examples: any knowledge you gain about a feature from seeing o
4ryan_greenblatt
Thanks for the great comment clarifying your thinking! I would be interested in seeing the data dimensionality curve for the validation set on MNIST (as opposed to just the train set) - it seems like the stated theory should make pretty clear predictions about what you'd see. (Or maybe I'll try to make reproduction happpen and do some more experiments). I feel pretty confused, but my overall view is that many of the routes I currently feel are most promising don't require solving superposition. At least, they don't require solving superposition in the sense I think you mean. I'm not sure the rest of my comment here will make sense without more context, but I'll try. Specifically, these routes require decoding superposition, but not obviously more so than training neural networks requires decoding superposition. Overall, the hope would be something like 'SGD is already finding weights which decode the representations the NN is using, so we can learn a mechanistic "hypothesis" for how these weights work via SGD itself'. I think this exact argument doesn't quite pan out (see here for related discussion) and it's not clear what we need as far as mechanistic 'hypotheses' go.
5LawrenceC
Thanks for the clarification! I'll have to think more about this. 

See also the Curve Detectors paper for a very narrow example of this (https://distill.pub/2020/circuits/curve-detectors/#dataset-analysis -- a straight line on a log prob plot indicates exponential tails).

I believe the phenomena of neurons often having activation distributions with exponential tails was first informally observed by Brice Menard.

I just stumbled on this post and wanted to note that very closely related ideas are sometimes discussed in interpretability under the names "universality" or "convergent learning": https://distill.pub/2020/circuits/zoom-in/#claim-3

In fact, not only do the same features form across different neural networks, but we actually observe the same circuits as well (eg. https://distill.pub/2020/circuits/frequency-edges/#universality ).

1Nathan Helm-Burger
some more relevant discussion here on abstraction in ML: https://www.deepmind.com/publications/abstraction-for-deep-reinforcement-learning

Well, goodness, it's really impressive (and touching) that someone absorbed the content of our paper and made a video with thoughts building on it so quickly! It took me a lot longer to understand these ideas.

I'm trying to not work over the holidays, so I'll restrict myself to a few very quick remarks:

  • There's a bunch of stuff buried in the paper's appendix which you might find interesting, especially the "additional intuition" notes on MLP layers, convolutional-like structure, and bottleneck activations. A lot of it is quite closely related to the thin

... (read more)
1MadHatter
Thanks! Enjoy your holidays! * Well now I feel kind of dumb (for misremembering how LayerNorm works). I've actually spent the past day since making the video wondering why information leakage of the form you describe doesn't occur in most transformers, so it's honestly kind of a relief to realize this. * It seems to me that ReLU is a reasonable approximation of GELU, even for networks that are actually using GELU. So one can think about the GELU(x)=xΦ(x) as just having a slightly messy mask function (Φ(x)) that is sort-of-well-approximated by the ReLU binary mask function.

Thanks for making that distinction, Steve. I think the reason things might sounds muddled is that many people expect that (1) will drive (2).

Why might one expect (1) to cause (2)? One way to think about it is that, right now, most ML experiments optimistically given 1-2 bits of feedback to the researcher, in the form of whether their loss went up or down from a baseline. If we understand the resulting model, however, that could produce orders of magnitude more meaningful feedback about each experiment. As a concrete example, in InceptionV1, there are a

... (read more)

Evan, thank you for writing this up! I think this is a pretty accurate description of my present views, and I really appreciate you taking the time to capture and distill them. :)

I’ve signed up for AF and will check comments on this post occasionally. I think some other members of Clarity are planning to so as well. So everyone should feel invited to ask us questions.

One thing I wanted to emphasize is that, to the extent these views seem intellectually novel to members of the alignment community, I think it’s more accurate to attribute the no... (read more)

9Ben Pace
Btw, we just pushed some basic subscriptions options. In the triple-dot menu for this post (at the top of the post), there's an option to 'subscribe to comments' and you'll get notified of new comments any time you go to the Alignment Forum, rather than having to check this page in particular. Edit: There was actually a bug in the notifications system on the AI Alignment Forum when I wrote this comment. It's fixed now.

Subscribed! Thanks for the handy feature.

One thing I'd add, in addition to Evan's comments, is that the present ML paradigm and Neural Architecture Search are formidable competitors. It feels like there’s a big gap in effectiveness, where we’d need to make lots of progress for “principled model design” to be competitive with them in a serious way. The gap causes me to believe that we’ll have (and already have had) significant returns on interpretability before we see capabilities acceleration. If it felt like interpretability was accelerating capabilities on the present margin, I’d be a bit more

... (read more)
3kvas
What if we think about it the following way? ML researchers range from _theorists_ (who try to produce theories that describe how ML/AI/intelligence works at the deep level and how to build it) to _experimenters_ (who put things together using some theory and lots of trial and error and try to make it perform well on the benchmarks). Most people will be somewhere in between on this spectrum but people focusing on interpretability will be further towards theorists than most of the field. Now let's say we boost the theorists and they produce a lot of explanations that make better sense of the state of the art that experimenters have been playing with. The immediate impact of this will be improved understanding of our best models and this is good for safety. However, when the experimenters read these papers, their search space (of architectures, hyperparameters, training regimes, etc.) is reduced and they are now able to search more efficiently. Standing on the shoulders of the new theories they produce even better performing models (however they still incorporate a lot of trial and error because this is what experimenters do). So what we achieved is better understanding of the current state of the art models combined with new improved state of the art that we still don't quite understand. It's not immediately clear whether we're better off this way. Or is this model too coarse to see what's going on?

I think that’s a fair characterization of my optimism.

I think the classic response to me is “Sure, you’re making progress on understanding vision models, but models with X are different and your approach won’t work!” Some common values of X are not having visual features, recurrence, RL, planning, really large size, and language-based. I think that this is a pretty reasonable concern (more so for some Xs than others). Certainly, one can imagine worlds where this line of work hits a wall and ends up not helping with more powerful systems. However, I would

... (read more)

I'm curious what's Chris's best guess (or anyone else's) about where to place AlphaGo Zero on that diagram

Without the ability to poke around at AlphaGo -- and a lot of time to invest in doing so -- I can only engage in wild speculation. It seems like it must have abstractions that human Go players don’t have or anticipate. This is true of even vanilla vision models before you invest lots of time in understanding them (I've learned more than I ever needed to about useful features for distinguishing dog species from ImageNet models).

But I’d hope the abstr

... (read more)
4ESRogs
Is there anything that prevents them from being used as microscopes though? Presumably you can still inspect the models it has learned without using it as an agent (after it's been trained). Or am I missing something?