Some people worry about interpretability research being useful for AI capabilities and potentially net-negative. As far as I was aware of, this worry has mostly been theoretical, but now there is a real world example: The hungry hungry hippos (H3) paper.

Tl;dr: The H3 paper

  • Proposes an architecture for sequence modeling which can handle larger context windows than transformers
  • Was inspired by interpretability work.

(Note that the H3 paper is from December 2022, and it was briefly mentioned in this discussion about publishing interpretability research. But I wasn’t aware of it until recently and I haven’t seen the paper discussed here on the forum.)

 I don't know why the paper is called that way. One of the authors mentioned that it’s called “hungry hungry hippos” rather than “hungry hippos” because it uses two state space model layers rather than one. But I think they don’t mention why the hippo is hungry. Or why it’s a hippo.

Larger Context Windows

The H3 paper proposes a way to use state space models (SSMs) for language models instead of attention. With an SSM it’s possible to model longer sequences. Using attention, the compute for context window length  scales with . Using the SSM based architecture, the compute scales with 

Inspired by Interpretability Work

The paper mentions that the work was inspired by Anthropic’s In-context learning and induction heads paper. 

E.g. they write

We provide an informal sketch of a two-layer attention model that can solve the associative recall task, inspired by the construction of [In-context learning and induction heads paper].

There is also the "Hyena paper" which builds on the H3 paper, and was also inspired by the induction heads paper:

This work would not have been possible without [...] inspiring research on mechanistic understanding of Transformers (Olsson et al. 2022; Power et al. 2022; Nanda et al. 2023).

My Takes

  • These two papers in particular will probably not shorten AI timelines much.
    • It seems unlikely that this type of architecture ends up being the state of the art.
  • However, the example makes me take the downsides of publishing interpretability research more seriously.
    • Even if this work itself is not a key capability milestone, it shows that there is truth in the argument “If we understand systems better, it will not just be useful for safety but also lead to capability advancements”
  • Capabilities externalities are a strong argument that most (good) interpretability research should not be published
    • There are alternative ways to distribute research which are less risky than publishing.
      • We can probably learn something by studying military research practices which have a similar use case of “make research accessible to other researchers while preventing it from becoming public”
      • The constraints are less strict than with military research because there is not an adversary force trying really hard to get access.
    • Maybe this is already relatively common (I would not know of most unpublished research)
  • On the other hand, interpretability research is probably crucial for AI alignment. 
    • I think it is possible but unlikely that we get alignment without extremely good interpretability.
    • The cost of keeping interpretability research private is really high. Publishing is a great driver of scientific progress.
  • Overall, publishing interpretability research seems both pretty risky, and extremely valuable, and it’s not clear to me if it is worth it.

Your Takes?

I would be really interested to see a discussion about this! 

  • How big a deal are the H3 and Hyena papers?
  • Does this example change your mind about whether publishing (or even doing) interpretability research is a good idea?


 

New Comment
22 comments, sorted by Click to highlight new comments since:

There's been other historical cases where authors credit prior interpretability work for capability advances, but afaik none of them have contributed to state-of-the-art models; interpretability is not something that only the AIS people have done. But as far as I know, no real capabilities advances have occurred as a result of any of these claims, especially not any that have persisted with scaling. (The Bitter Lesson applies to almost all attempts to build additional structure into neural networks, it turns out.) 

That's not to say that it's correct to publish everything! After all, given that so few capability advances stick, we both get very little signal on each case AND the impact of a single interp-inspired capability advance would be potentially very large. But I don't think the H3 paper should be much of an update in either direction (beyond the fact that papers like H3 exist, and have existed in the past).

 

 

As an aside: The H3 paper was one of the reasons why the linked "Should We Publish Mech Interp" post was written -- IIRC AIS people on Twitter were concerned about H3 as a capabilities advance resulting from interp, which sparked the discussion I had with Marius.

The Bitter Lesson applies to almost all attempts to build additional structure into neural networks, it turns out.

 

Out of curiosity, what are the other exceptions to this besides the obvious one of attention?

Off the top of my head: residual (skip) connections, improved ways of doing positional embeddings/encodings, and layer norm. 

Interpretability seems pretty useful for alignment, but it also seems pretty dangerous for capabilities. Overall the field seems net-bad. Using an oversimplified model, my general reason for thinking this is because for any given interpretability advance, it can either be used for the purposes of capabilities or the purposes of alignment. Alignment is both harder, and has fewer people working on it than improving model capabilities. Even if the marginal interpretability advance would be net good for alignment if alignment and capabilities were similar in size and difficulty, we should still expect that it will get used for the purposes of capabilities.

Lots of people like pointing to how better interpretability almost never makes long-term improvements to model capabilities, but it leads to just as few improvements to model alignment! And the number & quality of papers or posts using interpretability methods for capabilities vastly exceeds the number & quality using interpretability methods for alignment.

The only example of interpretability leading to novel alignment methods I know of is shard theory's recent activation additions work (notably work that is not so useful if Nate & Eliezer are right about AGI coherence). In contrast, it seems like all the papers using interpretability to advance capabilities rely on Anthropic's transformer circuits work.

These are two interesting case-studies, and more work should probably be done comparing their relative merits. But in lieu of that, my explanation for the difference in outcomes is this:

Anthropic's work was highly explorational, while Team Shard's was highly targeted. Anthropic tried to understand the transformer architecture and training process in general, while shard theory tried to understand values and only values. If capabilities is easier than alignment, it should not be too surprising if an unfocused approach makes capabilities relatively easier, while a focused-on-values approach makes alignment relatively easier. The unfocused approach will gather a wide range of low-hanging fruit, but little low-hanging fruit is alignment related, so most fruit gathered will be capabilities related.

This is why I'm pessimistic about most interpretability work. It just isn't focused enough! And its why I'm optimistic about interpretability (and interpretability adjacent) work focused on understanding explicitly the value systems of our ML systems, and how those can be influenced.

So a recommendation for those hoping to work on interpretability and have it be net-positive: Focus on understanding the values of models! Or at least other directly alignment relevant parts of models.

For example, I mostly expect a solution to superposition to be net-negative, in the same way that transoformer circuits is net-negative. Though at the same time I also expect superposition to have lots of alignment benefits in the short-term. If AGI is further off, superposition ends up being net-negative, the closer AGI is to now, the more positive a superposition solution becomes.

Another sort of interpretability advance I'm worried about: locating the optimization algorithms operating inside neural networks. I admit these have large alignment boosts, but that seems inconsequential compared to their large potential for large boosts to capabilities. Such advances may be necessary for alignment though, so I'm more happy in a world where these are not so widely publicized, and given only to the superintelligence alignment wings of AGI labs [EDIT: and a group of researchers outside the labs, all in a way such that nobody shares it with people who may use the knowledge to advance capabilities].

[-]LawrenceCΩ9160

This is why I'm pessimistic about most interpretability work. It just isn't focused enough

Most of the "exploratory" interp work you suggest is trying to achieve an ambitious mechanistic understanding of models, which requires a really high degree of model understanding in general. They're not trying to solve particular concrete problems, and it seems unfair to evaluate them according to a different theory of change. If you're going to argue against this line of work, I think you should either argue that they're failing to achieve their theory of change, or that their theory of change is either doomed or useless.

So: do you think that ambitious mech interp is impossible? Do you think that current interp work is going the wrong direction in terms of achieving ambitious understanding? Or do you think that it'd be not useful even if achieved?

I agree that if your theory of change for interp goes through, "interp solves a concrete problem like deception or sensor tampering or adversarial robustness", then you better just try to solve those concrete problems instead of improving interp in general. But I think the case for ambitious mech interp isn't terrible, and so it's worth exploring and investing in anyways. 

The only example of interpretability leading to novel alignment methods I know of is shard theory's recent activation additions work

There's a lot of interpretability work that performs act-add like ablations to confirm that their directions are real, and ITI is basically act adds but they compute act adds with many examples instead of just a pair. But again, most mech interp people aren't aiming to use mech interp to solve a specific concrete problem you can exhibit on models today, so it seems unfair to complain that most of the work doesn't lead to novel alignment methods. 

So: do you think that ambitious mech interp is impossible? Do you think that current interp work is going the wrong direction in terms of achieving ambitious understanding? Or do you think that it'd be not useful even if achieved?

Mostly I think that MI is right to think it can do a lot for alignment, but I suspect that lots of the best things it can do for alignment it will do in a very dual-use way, which skews heavily towards capabilities. Mostly because capabilities advances are easier and there are more people working on those.

At the same time I suspect that many of those dual use concerns can be mitigated by making your MI research targeted. Not necessarily made such that you can do off-the-shelf interventions based on your findings, but made such that if it ever has any use, that use is going to be for alignment, and you can predict broadly what that use will look like.

This also doesn't mean your MI research can't be ambitious. I don't want to criticize people for being ambitious or too theoretical! I want to criticize people for producing knowledge on something which, while powerful, seems powerful in too many directions to be useful if done publicly.

I agree that if your theory of change for interp goes through, "interp solves a concrete problem like deception or sensor tampering or adversarial robustness", then you better just try to solve those concrete problems instead of improving interp in general. But I think the case for ambitious mech interp isn't terrible, and so it's worth exploring and investing in anyways.

I don't entirely know what you mean by this. How would we solve alignment by not going through a concrete problem? Maybe you think MI will be secondary to that process, and will give us useful information about what problems are necessary to solve? In such a case I still don't see why you need ambitious MI. You can just test the different problem classes directly. Maybe you think the different problem classes are too large to test directly. Even in that case, I still think that a more targeted approach would be better, where you generate as much info about those target classes as possible, while minimizing info that can be used to make your models better. And you selectively report only the results of your investigation which bear on the problem class. Even if the research is exploratory, the result & verification demonstration can still be targeted.

But again, most mech interp people aren't aiming to use mech interp to solve a specific concrete problem you can exhibit on models today, so it seems unfair to complain that most of the work doesn't lead to novel alignment methods.

Maybe I misspoke. I dislike current MI because I expect large capability improvements before and at the same time as the alignment improvements, but I don't dispute future alignment improvements. Just whether they'll be worth it. The reason I brought up that was as some motivation for why I think targeted is better, and why I don't like some peoples' criticism of worries about MI externalities by appealing to the lack of capabilities advances caused by MI. There've certainly been more attempts at capabilities improvements motivated by MI than there have been attempts at alignment improvements. Regardless of what you think about the future of the field, its interesting when people make MI discoveries which don't lead to too much capabilities advances.

I personally like activation additions because they give me evidence about how models mechanistically behave in a way which directly tells me about which threat models are more or less likely, and it has the potential to make auditing and iteration a lot easier. Accomplishments which ambitious MI is nowhere close to, and for which I expect its methods would have to pay a lot in terms of capability advances in order to get to. I mention this as evidence for why I expect targeted approaches are faster and cheaper than ambitious ones. At least if done publicly.

A couple of unconnected points:

Mostly I think that MI is right to think it can do a lot for alignment, but I suspect that lots of the best things it can do for alignment it will do in a very dual-use way, which skews heavily towards capabilities. Mostly because capabilities advances are easier and there are more people working on those.

This doesn't clearly follow: one way for x to be easier is [there are many ways to do x, so that it's not too hard to find one]. If it's easy to find a few ways to get x, giving me another one may not help me at all. If it's hard to find any way to do x, giving me a workable approach may be hugely helpful.
(I'm not making a case one way or another on the main point - I don't know the real-world data on this, and it's also entirely possible that the bar on alignment is so high that most/all MI isn't useful for alignment)

 

I mention this as evidence for why I expect targeted approaches are faster and cheaper than ambitious ones

I'm not entirely clear I understand you here, but if I do, my response would be: targeted approaches may be faster and cheaper at solving the problems they target. Ambitious approaches are more likely to help solve problems that you didn't know existed, and didn't realize you needed to target.

If targeted approaches are being used for [demonstrate that problems of this kind are possible], I expect they are indeed faster and cheaper. If we're instead talking about being used as part of an alignment solution, targeted approaches seem likely to be ~irrelevant (of course I'd be happy if I'm wrong on this!).
(again, assuming I understand how you're using 'targeted' / 'ambitious')

a lot of interpretability work that performs act-add like ablations to confirm that their directions are real

Minor clarifying point: Act-adds cannot be cast as ablations. Do you mean to say that the interp work uses activation addition to confirm real directions? Or that they use activation ablation/resampling/scrubbing?

ITI is basically act adds but they compute act adds with many examples instead of just a pair

Yup, ITI was developed concurrently, and (IIRC, private correspondence) was inspired by their work on Othello-GPT. So this is another instance of interp leading to an alignment technique (albeit two independent paths leading to a similar technique).

Minor clarifying point: Act-adds cannot be cast as ablations.

Sorry, ablation might be the wrong word here (but people use it anyways): the technique is to subtract/add/move along the discovered direction and see what happens to the outputs. It's possible there's a better or standard word that I can't think of write now.

Also, another example of an attempt at interp -> alignment would arguably be the model editing stuff following causal tracing in the ROME paper? 

I'd probably call it an act-add style intervention? Ablation connotes the removal of something. I originally thought patch, but patch should be about setting it to be equal to the activation on another input, and being able to put in 10x the original value doesn't count as patching.

Sure, though it seems too general or common to use a long word for it?

Maybe "linear intervention"?

We called it vector arithmetic in my Othello paper?

Oh, I like that one! Going to use it from now on

Nice post. I think it can serve as a good example about how the hand waviness of how interpretability can help us do good things with AI goes both ways. 

I'm particularly worried about MI people studying instances of when LLMs do and don't express types of situational awareness and then someone using these insights to give LLMs much stronger situational awareness abilities.

Lastly,

On the other hand, interpretability research is probably crucial for AI alignment. 

I don't think this is true and I especially hope it is not true because (1) mechanistic interpretability still fails to do impressive things by trying to reverse engineer networks and (2) it is entirely fungible from a safety standpoint with other techniques that often do better for various things.

I think we'd agree that existing mech interp stuff has not done particularly impressive safety-relevant things. But I think the argument goes both ways, and pessimism for (mech) interp should also apply for its ability to do capabilities-relevant things as well. 

I'm particularly worried about MI people studying instances of when LLMs do and don't express types of situational awareness and then someone using these insights to give LLMs much stronger situational awareness abilities.
 

To use your argument, what does MI actually do here? It seems that you could just study the LLMs directly with either behavioral evals or non-mechanistic, top-down interpretability methods, and probably get results more easily. Or is it a generic, don't study/publish papers on situational awareness?

On the other hand, interpretability research is probably crucial for AI alignment. 

As you say in your linked post, I think it's important to distinguish between mechanistic interp and broadly construed model-internals ("interpretability") research. That being said, my guess is that we'd agree that the broadly construed version of interpretability ("using non-input-output modalities of model interaction to better predict or describe the behavior of models") is clearly important, and also that mechanistic interp has not made a super strong case for its usefulness as of writing.

Thanks

To use your argument, what does MI actually do here?

The inspiration, I would suppose. Analogous to the type claimed in the HHH and hyena papers. 

 

And yes to your second point.

 I don't know why the paper is called that way. ... I think they don’t mention why the hippo is hungry. Or why it’s a hippo.

It's a kids' game.

Yep.

Specifically, it's named for the papers HiPPO: Recurrent Memory with Optimal Polynomial Projections and How to Train Your HiPPO: State Space Models with Generalized Orthogonal Basis Projections, which kicked off the whole transformers for state-space modelling thing

"HiPPO" abbreviates higher-order polynomial projection operators.

I agree that interpretability research is risky, and that one should carefully consider whether it is worth it to perform this interpretability research. I propose that a safer alternative would be to develop machine learning models that are

  1. quite unrelated to the highest performing machine learning models (this is so that capability gains in these safer models do not translate very well to capability gains for the high performing models),
  2. as inherently interpretable as one can make these models,
  3. more limited in functionality than the highest performing machine learning models, but
  4. still functional enough to be useful in practice.

These characteristics are correlated with each other, so it is likely that one would get all of these characteristics at once. For example, one will most likely not get interpretability for free but instead trade interpretability for performance. Being unrelated to the highest performing machine learning models will also likely reduce the performance of the machine learning models.

It seems like spectral methods such as PageRank or the eigenvectors of the Laplacian of a graph satisfy some of these characteristics, but it would be good if there were more research into developing spectral methods further to increase their functionality.

Perhaps interpretability research has yielded such few results because we need to first develop inherently interpretable models.

I certainly think that developing fundamentally more interpretable models from scratch is a wise path forward for humanity. I think you make some reasonable proposals for directions that could be pursued. There are quite a few researchers and groups working on a wide variety of directions for this sort of fundamentally more interpretable and controllable AI. For example: https://www.lesswrong.com/posts/ngEvKav9w57XrGQnb/cognitive-emulation-a-naive-ai-safety-proposal 

The downside is that it's almost certainly a slower path to power. If you don't simultaneously slow down all the other, more direct, paths to raw AI power then the slow paths become irrelevant. Like building a very safe campfire in the woods right next to someone building a huge dangerous bonfire. So then you get into the issue of worldwide monitoring and enforcement of AI R&D, which is not an easy problem to tackle. Another way of thinking about this is saying that pursuing safer but less straightforwardly powerful approaches is paying an 'alignment tax'. https://www.lesswrong.com/tag/alignment-tax 

I am very much in favor of this approach by the way. I'm just really concerned about the feasibility and success likelihood of worldwide regulatory enforcement.

I am not expecting any worldwide regulation on AI that prohibits people from using or training unaligned systems (I am just expecting a usual level of regulation). I am mainly hoping for spectral techniques to develop to the point where AI groups will want to use these spectral techniques (or some other method) more and more until they are competitive with neural networks at general tasks or at least complement the deficiencies of neural networks. I also hope that these spectral techniques will remain interpretable and aligned.

Right now, there are several kinds of tasks in which I would rather use spectral techniques than neural networks. I have been evaluating the cryptographic security of block ciphers with small message size and very small key size (for cryptocurrency research), and it seems like the spectral techniques that I have developed give consistent measures of security for such block ciphers (I am not done with the training yet) and these spectral techniques are better for cryptanalysis than neural networks. I have been able to use these spectral techniques for other problems such as the problem of finding the largest clique in a graph (this is not something that I would have expected before I did it), and right now these spectral techniques are the only way I know how to transform a non-commutative polynomial into something that other machine learning models can work with better.

Right now, I do not know how to use spectral techniques to replace deep neural networks. I do not know how to use spectral techniques to approximate a universal function and I do not know how to use spectral techniques to make machine learning models with many layers. I hope to be able to solve these problems of spectral techniques, but I agree that there will be a tradeoff between performance and interpretability. The goal is to make this tradeoff favor interpretable, aligned, and safe machine learning models.

Note that the previous post on this topic from Lawrence and Marius includes public comments about interp externalities from several people including me

The rough TLDR of my opinion is that interp seems reasonably far from being dangerous to capabilities, and there's a ton of work to be done, and I think the value of building a thriving academic field around it seems super high, and I think significant barriers to publishing and sharing work is toxic to this, and IMO this consideration dominates over some capabilities externalities. I also think that getting capabilities people to pivot from whatever they're doing now to mech interp seems like a win, it's not clear this should be more capabilities-y than the counterfactual, but clearly better for alignment in expectation. I also think junior people get way too stressed about this and should chill out, since your first piece of work is highly unlikely to have any effects, positive or negative