Against Almost Every Theory of Impact of Interpretability

Charbel-Raphaël

Best of LessWrong 2023

331 Against Almost Every Theory of Impact of Interpretability

by Charbel-Raphaël

17th Aug 2023

AI Alignment Forum

31 min read

331 Ω 96

Epistemic Status: I believe I am well-versed in this subject. I erred on the side of making claims that were too strong and allowing readers to disagree and start a discussion about precise points rather than trying to edge-case every statement. I also think that using memes is important because safety ideas are boring and anti-memetic. So let’s go!

Many thanks to @scasper, @Sid Black , @Neel Nanda , @Fabien Roger , @Bogdan Ionut Cirstea, @WCargo, @Alexandre Variengien, @Jonathan Claybrough, @Edoardo Pona, @Andrea_Miotti, Diego Dorn, Angélina Gentaz, Clement Dumas, and Enzo Marsot for useful feedback and discussions.

When I started this post, I began by critiquing the article A Long List of Theories of Impact for Interpretability, from Neel Nanda, but I later expanded the scope of my critique. Some ideas which are presented are not supported by anyone, but to explain the difficulties, I still need to 1. explain them and 2. criticize them. It gives an adversarial vibe to this post. I'm sorry about that, and I think that doing research into interpretability, even if it's no longer what I consider a priority, is still commendable.

How to read this document? Most of this document is not technical, except for the section "What does the end story of interpretability look like?" which can be mostly skipped at first. I expect this document to also be useful for people not doing interpretability research. The different sections are mostly independent, and I’ve added a lot of bookmarks to help modularize this post.

If you have very little time, just read (this is also the part where I’m most confident):

Auditing deception with Interp is out of reach (4 min)
Enumerative safety critique (2 min)
Technical Agendas with better Theories of Impact (1 min)

Here is the list of claims that I will defend:

(bolded sections are the most important ones)

Note: The purpose of this post is to criticize the Theory of Impact (ToI) of interpretability for deep learning models such as GPT-like models, and not the explainability and interpretability of small models.

The emperor has no clothes?

I gave a talk about the different risk models, followed by an interpretability presentation, then I got a problematic question, "I don't understand, what's the point of doing this?" Hum.

Feature viz? (left image) Um, it's pretty but is this useful?^[1] Is this reliable?
GradCam (a pixel attribution technique, like on the above right figure), it's pretty. But I’ve never seen anybody use it in industry.^[2] Pixel attribution seems useful, but accuracy remains the king.^[3]
Induction heads? Ok, we are maybe on track to retro engineer the mechanism of regex in LLMs. Cool.

The considerations in the last bullet points are based on feeling and are not real arguments. Furthermore, most mechanistic interpretability isn't even aimed at being useful right now. But in the rest of the post, we'll find out if, in principle, interpretability could be useful. So let's investigate if the Interpretability Emperor has invisible clothes or no clothes at all!

The overall Theory of Impact is quite poor

Neel Nanda has written A Long List of Theories of Impact for Interpretability, which lists 20 diverse Theories of Impact. However, I find myself disagreeing with the majority of these theories. The three big meta-level disagreements are:

Whenever you want to do something with interpretability, it is probably better to do it without it. I suspect Redwood Research has stopped doing interpretability for this reason (see the current plan here EAG 2023 Bay Area The current alignment plan, and how we might improve it).
- This is particularly true for counteracting deceptive alignment, even though it is the main focus of interpretability research. [section deception]
Interpretability often attempts to address too many objectives simultaneously. Please Purchase Fuzzies and Utilons Separately: i.e. it is very difficult to optimize multiple objectives at the same time! It is better to optimize directly for each sub-objective separately rather than mixing everything up. When I look at this list by Neel Nanda, I see that this principle is not followed.
Interpretability could be harmful. Using successfully interp for safety could certainly prove useful for capabilities. [section Harm]

Other less important disagreements:

Conceptual advances are more pressing, and interp likely won't assist in advancing these discussions. [section end story]
Current interpretability is primarily used for post-hoc analysis and has shown little utility ex-ante or for predictive capacity [section predictor of future systems]

Here are some key theories with which I disagree:

Theory of Impact 2: “Better prediction of future systems”
Theory of Impact 4: “Auditing for deception”

In the appendix, I critique almost all the other Theories of Impact.

Interp is not a good predictor of future systems

Theory of Impact 2: “Better prediction of future systems: Interpretability may enable a better mechanistic understanding of the principles of how ML systems and work, and how they change with scale, analogous to scientific laws. This allows us to better extrapolate from current systems to future systems, in a similar sense to scaling laws. E.g, observing phase changes a la induction heads shows us that models may rapidly gain capabilities during training” from Neel Nanda.

Nitpicking on the Induction head example. If we focus on the above example "models may rapidly gain capabilities during training," I don't have the impression that it was interpretability that enabled us to find this out, but rather behavioral evaluations. Loss was measured regularly during training, and the rapid gain of induction capability was measured by having a model copy a random series of tokens. In the beginning, copying does not work, but after some training, it works. Interpretability only tells us that this coincides with the appearance of induction heads, but I don't see how interpretability allows us "to better extrapolate from current systems to future systems. Also, induction heads are studied in the first place because they were easy to study.
Interpretability is mostly done ex-post the discovery of phenomenon, but not ex-ante.^[4]
- We first observed the grokking phenomenon, and only then we did proceed to do some interpretability on it. Are there any counterexamples?
- (In What DALL-E 2 can and cannot do, we see that DALL-E 2 is not able to spell words correctly. Then 2 months later, Imagen could spell the words correctly. We didn’t even bother with interp.)
There are better ways to predict the future capabilities of those systems. Thinking out of the box, if you really want to see what future systems will look like, it's much easier to look at the papers published in the NeurIPS conferences and cognitive architecture like AutoGPT. Otherwise, subscribing to DeepMind's RSS feed is not a bad idea.

Auditing deception with interp is out of reach

Auditing deception is generally the main motivation for doing interp. So here we are:

Theory of Impact 4: Auditing for deception: Similar to auditing, we may be able detect deception in a model. This is a much lower bar than fully auditing a model, and is plausibly something we could do with just the ability to look at random bits of the model and identify circuits/features - I see this more as a theory of change for 'worlds where interpretability is harder than I hope' from Neel Nanda.

I don't understand how “Looking at random bits of the model and identify circuits/features” will help with deception. For example, let's say I reverse engineer GPT2 for a random circuit, such as in the paper Interpretability in the wild, where they retro engineer the indirect object identification circuit. It’s not clear at all how this will help with deception.
- Even if the intended meaning was "identify circuits/features that may be relevant to deception/social modeling", it's not clear whether analyzing every circuit would be sufficient (see the "Enumerative Safety" subsection).
We are nowhere near the level required to detect or train away deception with interp. In his article A transparency and interpretability tech tree, Evan Hubinger lists 8 levels of interpretability, with only levels 7 and 8 providing some means to combat deception. These levels roughly describe the desiderata of interpretability, but we have only reached level 2 so far, and we have already encountered negative results at level 4. Evan explains that "any level of transparency and interpretability tech that is robust to deceptive models is extremely difficult".
“Furthermore, trying to uncover deception in advance via interpretability tools could fail simply because there is no sense that a deceptively aligned model has to actively be thinking about its deception. A model that has never seen a situation where there is an opportunity to seize power need not be carefully planning out what it would do in such a situation any more than a factory cleaning robot need be planning for what to do if it someday found itself in a jungle instead of a factory. Nevertheless, the fact that the model has not been previously planning to seize power doesn’t imply that it wouldn’t if given the opportunity. In particular, a model could be deceptively aligned simply because it reasons that, in situations where there is a clear overseer, doing what it wants is a good general strategy for gaining power and influence in the world—without needing any explicit plans for later deception.” (from Hubinger in Monitoring for deceptive alignment)
There are already negative conceptual points against interpretability, which show that advanced AIs will not be easily interpretable, as discussed in the section on interpretability in the list of lethalities (these are points I’ve tried in the past to critique and have mostly failed). Especially points 27, 29, and 33:
- 27. Selecting for undetectability: “Optimizing against an interpreted thought optimizes against interpretability.”
- 29. Real world is an opaque domain: “The outputs of an AGI go through a huge, not-fully-known-to-us domain (the real world) before they have their real consequences. Human beings cannot inspect an AGI's output to determine whether the consequences will be good.”
  - And cognition can be externalized. This is not specific to interp. Many patterns can only be explained for how they interact with the environment, and can't be fully explained by what's in the network alone. E.g. “Consult a recipe book and take actions written in this book.” (example from Connor).
- 33. Alien Concepts: “The AI does not think like you do” There may not necessarily be a humanly understandable explanation for cognition done by crunching numbers through matrix products.
- I don't fully agree with all of these points, but I haven't seen much discussion on these specific points, you can find some caveats in my critique
Other weaker difficulties in footnote.^[5]

Counteracting deception with only interp is not the only approach:

Interp is not the only way to study Deception. Here are other neglected paradigms:
- Adversarial attack (basically Redwood’s Plan). Lots of important adversarial strategies don’t rely on interp like Consistency checks, AI checks and balances and regular Prompting^[6].
- Create a toy model of deceptive alignment or a simulation: Studying “in vitro demonstrations of the kinds of failures that might pose existential threats” from Model Organisms of Misalignment, a pretty recent and detailed post on this.
- Finding deceptive alignment proxies: For example monitoring the following 4 criteria which are generally seen as prerequisites of deceptive alignment: Goal-directed behavior, Optimizing across episodes/long-term goal horizons, Conceptualization of the base goal, Situational awareness.
- Miscellaneous baseline strategies for near human-level AI:
  - In NLP, monitoring the plan and chain-of-thought of AutoGPT (details and caveats).
  - In vision, EfficientZero-like consistency loss - to see into the future in the Monte Carlo tree (draft)
- More ideas like Neural distillation and Speed Priors.
- If DeepMind was to announce today that they had discovered deception in a GPT, it's unlikely that they would have used only interpretability to make that discovery. It's far more likely they would have used regular prompting.
There are preventive measures against Deceptive Alignment which seem much more workable (See section Preventive measures against Deception).
Conceptual advances are more urgent. It's much more fruitful to think about deception conceptually than through interpretability. And to the best of my knowledge, interpretability hasn't taught us anything about deception yet.
- For example, the Simulator Theory and the understanding that GPTs can already emulate deceptive simulacra is a bigger advance in our understanding of deceptive alignment than what has happened in interpretability for deception.
- Conceptual considerations on deceptive alignment, as in the article Deceptive Alignment is <1% Likely by Default or How likely is deceptive alignment? don’t rely at all on interpretability.

Inspired by every discussion I’ve had with friends defending interp. “Your argument for astronomy is too general”, so let's deep dive into some object-level arguments in the following section!

What does the end story of interpretability look like? That’s not clear at all.

This section is more technical. Feel free to skip it and go straight to "So far my best ToI for interp: Outreach" , or just read the "Enumerative safety" section, which is very important.

Of course, it seems that interpretability in deep learning is inherently more feasible than neuroscience because we can save all activations and run the model very slowly, by trying causal modifications to understand what is happening, and allows much more control than an fMRI. But it seems to me that this is still not enough - we don't really know what we are aiming for and rely too much on serendipity. Are we aiming for:

Enumerative safety?

Enumerative safety, as Neel Nanda puts it, is the idea that we might be able to enumerate all features in a model and inspect this for features related to dangerous capabilities or intentions. I think this strategy is doomed from the start (from most important to less important):

Determining the dangerousness of a feature is a mis-specified problem. Searching for dangerous features in the weights/structures of the network is pointless. A feature is not inherently good or bad. The danger of individual atoms is not a strong predictor of the danger of assembly of atoms and molecules. For instance, if you visualize the feature of layer 53, channel 127, and it appears to resemble a gun, does it mean that your system is dangerous? Or is your system simply capable of identifying a dangerous gun? The fact that cognition can be externalized also contributes to this point.
A feature is still a fuzzy concept, and the problem of superposition and the natural abstraction hypothesis remains a hypothesis three years after those Distill papers, with very few convincing strategies to solve them. And that's not very surprising: the central conceptual notion of interpretability, the “feature”, seems to be intrinsically fuzzy and is still not defined. This is a major problem for the "enumerative safety" strategy and for iterating on neurons one by one to verify the "goodness" of each feature and obtain guarantees:
- And because of superposition, iterating over each neuron is not valid. So we can't just iterate on neurons, but we have to iterate on all sets of neurons (or worse, all directions), which is totally computationally intractable.
Properties of models which are dangerous are not low-level features, but high-level behavioral abilities like being able to code, sycophancy or various theories of mind proxies, situational awareness, or hacking.
- A network's situational awareness will likely include several sub-features such as date and time, geographical position, and the current needs of its users. Removing these sub-features would make the model less competitive.
Deep Deceptiveness - In simple terms, a system can be deceptive even if no single part is dangerous because of optimization pressure, and complex interactions between the model and the environment.
This strategy has already been tried for vision via automatic interpretability techniques to label all neurons, and it doesn't seem to have advanced alignment much, and most neurons evade simple interpretations:
- NetDisect & Compositional explanations of neurons (Mu and Andreas, 2021)
- Natural Language Descriptions of Deep Visual Features (Andreas, 2022)
- Clip-Dissect (Oikarinen, 2022) Towards a Visual Concept Vocabulary for GAN Latent Space (Schwettmann, 2021)
- These works [partially summarized here] have not changed the way we try to make vision systems more robust and less risky in practice.
Most automatic interpretability works, like Language models can explain neurons in language models from OpenAI or concept erasure techniques, falls into this category.

Reverse engineering?

Reverse engineering is a classic example of interpretability, but I don't see a successful way forward. Would this be:

The equivalent C++ annotated algorithm of what the model does? Being able to reproduce the capabilities of the inscrutable matrices of GPT-4 by some modular C++ code by would be past human level intelligence already, and this would be too dangerous, because this would allow a lot of different optimization, and probably allow recursive self-improvement which seems dangerous especially if we rely on an automated process for that.
An explanation in layman terms of the behavior of the model? At which level of granularity? Each token or sentence or paragraph? This is really unclear.
The functional connectome of the model obtained with high level interp? Ok, you see in the functional connectome that the model is able to code and to hack, and those are dangerous capabilities. Isn’t this just regular evals?
- In practice, to conduct interp experiments, we almost always start by creating a dataset of prompts. Maybe one day we won't need prompts to activate these capabilities, but I don't see that happening anytime soon.
A graph to explain the circuits? Graphs like the ones just below can be overwhelming and remain very limited.

You can notice that “Enumerative safety” is often hidden behind the “reverse engineering” end story.

From the IOI paper. Understanding this diagram from 'Interpretability in the Wild' by Wang et al. 2022 is not essential for our discussion. Understanding the full circuit and the method used would require a three-hour video. And, this analysis only focuses on a single token and involves numerous simplifications. For instance, while we attempt to explain why the token 'Mary' is preferred over 'John', we do not delve into why the model initially considers either 'Mary' or 'John'. Additionally, this analysis is based solely on GPT2-small.

Indeed, this figure is quite terrifying. from Causal scrubbing: results on induction heads, for a 2 layer model. After refining 4 times the hypothesis, they are able to restore 86% of the loss. But even for this simple task they say “we won’t end up reaching hypotheses that are fully specific or fully human-understandable, causal scrubbing will allow us to validate claims about which components and computations of the model are important.”.

The fact that reverse engineering is already so difficult in the two toy examples above seems concerning to me.

Olah’s interpretability dream?

Or maybe interp is just an exploration driven by curiosity waiting for serendipity?

Interpretability Dreams is an informal note by Chris Olah on future goals for mechanistic interpretability. It discusses superposition, the enemy of interpretability. Then, towards the end of the note, In the section titled “How Does Mechanistic Interpretability Fit Into Safety?”, we understand the plan is to solve superposition to be able to use the following formula:
But this is simply again “enumerative safety” stated in terms of circuits rather than features. However, as explained above, I don't think this leads us anywhere.
The final section of the note, Beauty and Curiosity, reads like a poem or hymn to beauty. However, it seems to lack substance beyond a hope for serendipitous discovery.

Overall, I am skeptical about Anthropic's use of the dictionary learning approach to solve the superposition problem. While their negative results are interesting, and they are working on addressing conceptual difficulties around the concept of "feature" (as noted in their May update), I remain unconvinced about the effectiveness of this approach, even after reading their recent July updates, which still do not address my objections about enumerative safety.

One potential solution Olah suggests is automated research: "it does seem quite possible that the types of approaches […] will ultimately be insufficient, and interpretability may need to rely on AI automation". However, I believe that this kind of automation is potentially harmful [section Harmful].

This is still a developing story, and the papers published on Distill are always a great pleasure to read. However, I remain hesitant to bet on this approach.

Retargeting the search?

Or maybe interp could be useful for retargeting the search? This idea suggests that if we find a goal in a system, we can simply change the system's goal and redirect it towards a better goal.

I think this is a promising quest, even if there are still difficulties:

This is interesting because this would be a way to not need to fully reverse engineer a complete model. The technique used in Understanding and controlling a maze-solving policy network seems promising to me. Just focusing on “the motivational API” could be sufficient.
But I still don’t know if Steering vectors (i.e. activation additions of a vector in the latent space) really count as interpretability, and really change significantly the picture of alignment beyond just prompt engineering. Ok, this is a new way to tinker with the model. But I don’t know how this could be used reliably against deception.^[7]

Relaxed adversarial training?

Relaxed adversarial training? The TL;DR is that relaxed adversarial training is the same as adversarial training, but instead of creating adversarial inputs to test the network, we create adversarial latent vectors. This could be useful because creating realistic adversarial inputs is a bottleneck in adversarial training. [More explanations here]

This seems valid but very hard, and there are still significant conceptual difficulties. A concrete approach, Latent Adversarial Training, has been proposed, and seems to be promising but:

The procedure is underspecified. There will be too many meta-parameters. Calibrating these meta-parameters will require some iteration, and you probably don’t want to iterate on deceptive powerful models. We have to be good right away from the first choice of meta-parameters. As the author himself says, "the only hope here lies in the Surgeon forcing the model to be robustly safe before it learns to deceive. Once the model is deceptive it's really game-over."
We still have no guarantees. This procedure allows for a latent space that is robust to “small perturbations”, but being robust to “small perturbations” is not the same as not becoming deceptive (it’s not clear to me that deception won’t appear outside the constraint zone).
Papers using this kind of procedure have only limited effectiveness, for example around 90% detection rate in the paper ABS: Scanning Neural Networks for Back-doors by Artificial Brain Stimulation (Liu et al., 2019). [Paper summarized here] And I don’t think this could work against all types of trojans.

The exact procedure described in Latent Adversarial Training hasn't been tested, as far as I know. So we should probably work on it.^[8]

Microscope AI?

Maybe Microscope AI i.e. Maybe we could directly use the AI’s world model without having to understand everything. Microscope AI is an AI that would be used not in inference, but would be used just by looking at its internal activations or weights, without deploying it. My definition would be something like: We can run forward passes, but only halfway through the model.

This goes against almost every economic incentive (see Why Tool AIs wants to become Agents AI, from Gwern).
($) Interpretability has been mostly useless for discovering facts about the world, and learning new stuff by only looking at the weights is too hard.
- In the paper Acquisition of Chess Knowledge in AlphaZero, the authors investigate whether “we can learn chess strategies by interpreting the trained AlphaZero's behavior”. Answer: This is not the case. They probe the network using only concepts already known to Stockfish, and no new fundamental insights are gained. We only check when AlphaGo learns human concepts during the training run.
- I don’t think we will be able to learn category theory by reverse engineering the brain of Terence Tao. How do Go players learn strategies from go programs? Do they interpret AlphaGo’s weights, or do they try to understand the behavioral evaluations of those programs? Answer: They learn from their behavior, but not by interpreting models. I am skeptical that we can gain radically new knowledge from the weights/activations/circuits of a neural network that we did not already know, especially considering how difficult it can be to learn things from English textbooks alone.

Microscope AIs should not be agentic by definition. But agency and exploration help tremendously at the human level for discovering new truths. Therefore, below superhuman level, the microscope needs to be agentic…and this is a contradiction. Using Microscope AI as a tool rather than an agent is suggested here or here for example. However, to know the truth of a complex fact, we need to experiment with the world and actively search for information. Here is a fuzzy reasoning (feel free to skip):
- A) Either the information already exists and is written plainly somewhere on the internet, and in that case, there is no need for Microscope AI (this is like text retrieval).
- B) Or the information doesn't exist anywhere on the internet, and in that case, it is necessary to be agentic by experimenting with the world or by thinking actively. This is the type of feature that can only be “created” by reinforcement learning but which cannot be “discovered” with supervised learning, like MuZero discovering new chess strategies.
- or C), this info is not plainly written but is a deep feature of the training data that could be understood/grokked through gradient descent. This is the type of feature that can be “discovered” with supervised learning.
- If B), we need agency, and it’s no longer a microscope.
- If C), we can apply the above reasoning ($) + Being able to achieve this through pure gradient descent without exploration is probably a higher level of capability than being able to do it with exploration. (This would be like discovering the Quaternion formula during a dream?). But even legendary mathematicians need to work a bit and be agentic in their exploration; they don't just passively read textbooks. Therefore, it's probably beyond Ramanujan's level and too dangerous?
- So, I'm quite uncertain, but overall I don't think Microscope AI is a promising or valid approach to reducing AI risk.

A short case study of Discovering Latent Knowledge technique to extract knowledge from models by probing is included in the appendix.

So far my best ToI for interp: Outreach?

1. Interp for Nerd Sniping/honeypot?

Interp is a highly engaging introduction to AI research. That's really cool for that, I use it for my classes, and for technical outreach, but I already have enough material on interpretability, for 10 hours of class, no need to add more.
Interp as a honeypot for junior researchers? Just as a honeypot attracts bees with its sweet nectar, interp is very successful for recruiting new technical people! but then they would probably be better off doing something else than interp (unless it is their strong comparative advantage).
(Nerd Sniping senior capability researchers into interpretability research? Less capability research, more time to align AIs? I’m joking, don’t do that at home! )

2. Honorable mentions:

Showing strange failures, such as the issue with the SolidGoldMagicCarp token, highlights the possibility of unexpected results with the model. More generally, interpretability tools can be useful for the red teaming toolbox. They seem like they might be able to guide us to more problems than test sets and adversaries can alone.
Showing GPT is not a stochastic parrot? The article Actually, Othello-GPT Has A Linear Emergent World Representation is really cool. Showing that OthelloGPT contains a world model is really useful for technical outreach (even if OthelloGPT being good at Othello should be enough, no?).
It's a good way to introduce the importance and tractability of alignment research “Interpretability gives people a non-technical story for how alignment affects their lives, the scale of the problem, and how progress can be made. IMO no other approach to alignment is anywhere near as good for this.” [from Raymond D]
Better: Showing that “We have basically no idea how it does what it does.”, see this tweet:

Preventive measures against Deception seem much more workable

TL;DR: It would be more desirable to aim for a world where we wouldn't need to delve into the internal components of models. Prevention is better than cure, or at least, it is a neglected strategy.

*From* *The Translucent Thoughts Hypotheses and Their Implications, Fabien Roger*

I don't believe interpretability is the most promising method for monitoring near human-level AGI, Here’s why:

I don't think neural networks will be able to take over in a single forward pass. Models will probably reason in English and will have translucent thoughts (and we could even hope to get transparent thoughts, translucence is the scary place where steganography is possible). In order to devise an effective plan to destroy the world, the first human-level AGI will need to somewhat externalize its reasoning, by chain-of-Thought (CoT), and it seems to me way easier to monitor this chain-of-Thought than probing into the models' internals. We can probably use The Translucent Thoughts Hypotheses (Fabien, the author, gives a 20% chance, but I think it will be more like 60%). I also think that we have to do everything we can to pass regulations and place ourselves in a world where those hypotheses are true.

For example, magic is possible, and the fact that human brains can be manipulated so easily by magic is a great danger. So we should probably try to make this magic unlikely, for example with process-based training by training AI Systems to reason only step-by-step (the process-based training approach is summarized in this AGISF distillation).

Steering the world towards transparency

Thinking ahead about the training setup is very important and possibly dominates interp considerations because if the concepts manipulated and by the models are totally alien, it will be much, much harder to provide oversight. And it is much easier to align those chatbots pretrained on human generated text than aligning AGIs trained from scratch with RL / evolutionary methods, etc.

If this is the case, we should focus on various aspects and important recommendations as detailed by Fabien here. Here are some highlights and comments:

Making LLMs use only understandable English: Preventing models from using steganography, not training on synthetic data, etc… These directions are generally very neglected - for example, look at this comment.
Building an oversight system: That’s the “ready to go” technical research of Buck, IDA, or neglected baselines like filtering, etc…
Testing the translucent thoughts hypotheses and being able to monitor the alienness of models is super important in ensuring some simulacra have brain like representations (see this and this). More work in the direction of Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting.
Making the hypotheses true, with technical works, outreach and governance, etc.

I think the strategy should be: let’s target a world where deception is unlikely. (I'm not saying we should make plans that work conditional on deception being unlikely by default, but we should try to steer AGI/the world towards a place where deception is unlikely). I believe there are multiple ways to think and address this problem, and much more technical research needed here, starting from Conditioning Predictive Models: Risks and Strategies.

Cognitive Emulations - Explainability By Design

If interpretability was really a bottleneck, we could use cognitive emulation, which, in my opinion, allows way better explainability and transparency than interpretability will ever get us.

My understanding of cognitive emulation: Emulating GPT-4 using LLMs like GPT-3 as different submodules that send messages written in plain English to each other before outputting the next token. If the neural network had deceptive thoughts, we could see them in these intermediate messages.

Some caveats are in the section Cognitive Emulation of the appendix.

Interpretability May Be Overall Harmful

(Note that some of the following points are not specific to interp, but I think they apply particularly well to interp.)

False sense of control:

False sense of understanding. It's too easy to think you begin to understand that we're starting to get guarantees when we have not much. This is very classic:
- Me from the past: "Yo, I spent 5 hours trying to understand the mechanisms of inductions head and K-Compositions in a Mathematical Framework for Transformers, I have so much more understanding." yes but no.
Overinterpretation. It is very difficult to say which interpretation result is solid. For example, Sanity Checks for Saliency Maps shows that most of the pixel attribution techniques are generally misleading.^[9] In the same vein, feature viz has recently been found to have some pretty fatal flaws, see Don't trust your eyes: on the (un)reliability of feature visualizations, and the model editing technique such as ROME is very misleading. This is mostly due to methodological problems that Stephen Casper explains in his sequence. [see appendix: methodological problems].
Safety Washing. I feel that there is a part of safety research which is here to legitimize capability research in the big labs (although this is not entirely specific to interp).
- “I think a really substantial fraction of people who are doing "AI Alignment research" are instead acting with the primary aim of "make AI Alignment seem legit". These are not the same goal, a lot of good people can tell and this makes them feel kind of deceived, and also this creates very messy dynamics within the field where people have strong opinions about what the secondary effects of research are, because that's the primary thing they are interested in, instead of asking whether the research points towards useful true things for actually aligning the AI”, from Shutting Down the Lightcone Offices.
The achievements of interp research are consistently graded on their own curve and overhyped compared to achievements in other fields like adversaries research. For example, the recent paper Universal and Transferable Adversarial Attacks on Aligned Language Models impressively found effective attacks against state-of-the-art models without any interpretations involving models internals. Imagine if mechanistic interpretability researchers did the exact same thing, but by studying model internals? Given the excitement that has emerged in the past around the achievements of mechanistic interpretability in toy models on cherry-picked problems (e.g. this or this), it seems that something like this would have probably made the AI safety research community go wild. Stephen Casper makes a similar point here: “From an engineer’s perspective, it’s important not to grade different classes of solutions each on different curves.” And other examples of this are presented here EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety (thanks to Stephen for highlighting this point).

The world is not coordinated enough for public interpretability research:

Dual use. It seems anything related to information representation can be used in a dual manner. This is a problem because I believe that the core of interpretability research could lead to major advances in capabilities. See this post.
- Using the insights provided by advanced interp to improve capabilities, such as modularity to optimize inference time and reduce flops, is likely to be easier than using them for better oversight. This is because optimizing for capability is much simpler than optimizing for safety, as we lack clear metrics for measuring safety (see the figure below).
When interpretability starts to be useful, you can't even publish it because it's too info hazardous. The world is not coordinated enough for public interpretability research.
- Nate Soares explained this, and this was followed by multiple posts. “Insofar as interpretability researchers gain understanding of AIs that could significantly advance the capabilities frontier, I encourage interpretability researchers to keep their research closed. […] I acknowledge that public sharing of research insights could, in principle, both shorten timelines and improve our odds of success. I suspect that isn’t the case in real life.”
- Good interp could produce a "foom overhang" as described in "AGI-Automated Interpretability is Suicide".
- Good interp also creates an infosec/infohazard attack vector.
- The post 'Why and When is Interpretability Work Dangerous?' ends on a sobering note, stating, “In closing, if alignment-conscious researchers continue going into the interpretability subfield, the probability of AGI ruin will tend to increase.”
Interpretability already helps capabilities. For example, the understanding of Induction head has allowed for better architectures^[10].
Interpretability may be a super wicked problem^[11].

Thus the list of "theory of impact" for interpretability should not simply be a list of benefits. It's important to explain why these benefits outweigh the possible negative impacts, as well as how this theory can save time and mitigate any new risks that may arise.

The concrete application of the logit lens is not an oversight system for deception, but rather capability works to accelerate inference speed like in this paper. (Note that the paper does not cite logit lens, but relies on a very similar method).

Outside view: The proportion of junior researchers doing interp rather than other technical work is too high

It seems to me that many people start alignment research as follows:

At the end of Arena, an advanced upskilling program in AI Safety, almost all research projects this year (June 2023), except for two out of 16, were interp projects.
At EffiSciences, at the end of the last 3 ML4Good bootcamps, students all start by being interested in interp, and it is a very powerful attractor. I myself am guilty. I have redirected too many people to it. I am now trying to correct my ways.
- In the past, if I reconstruct my motivational story, it goes something like this: "Yo, I have a math/ML background, how can I recycle that?" --> then brrr interp, without asking too many questions.
During Apart Research hackathons, interpretability hackathons tend to draw 3.12 times as many participants as other types of hackathons. (thinkathon, safety benchmarks, …).^[12]
Interpretability streams in Seri Mats are among the most competitive streams (see this tweet). People then try hard, get rejected, get disappointed and lose motivation. This is a recent important problem.

"Not putting all your eggs in one basket" seems more robust considering our uncertainty, and there are more promising ways to reduce x-risk per unit of effort (to come in a future post, mostly through helping/doing governance). I would rather see a more diverse ecosystem of people trying to reduce risks. More on this in section Technical Agendas with better ToI.

If you ask me if interp is also over represented in senior researchers, I'm a bit less confident. Interp also seems to be a significant portion of the pie: this year, while Conjecture and Redwood have partially pivoted, there are new active interp teams in Apollo, DeepMind, OpenAI, and still in Anthropic. I think I would particularly critique DeepMind and OpenAI's interpretability works, as I don't see how this reduces risks more than other works that they could be doing, and I'd appreciate a written plan of what they expect to achieve.

Even if we completely solve interp, we are still in danger

No one has ever claimed otherwise, but it's worth remembering to get the big picture. From stronger arguments to weaker ones:

There are many X-risks scenarios, not even involving deceptive AIs. Here is a list of such scenarios (see this cheat sheet):
- Christiano1 - You get what you measure
- Critch1 - Production Web
- Soares - A central AI alignment problem: capabilities generalization, the sharp left turn
- Cohen et al. - Advanced artificial agents intervene in the provision of reward
- Gwern - It Looks Like You’re Trying To Take Over The World
- Exercise: Here is a list of risks from the Center of AI Safety. Which ones can be solved by interp? At least half of those risks don’t directly involve deception and interp.
Total explainability of complex systems with great power is not sufficient to eliminate risks. Significant risks would still remain. Despite our full understanding of how atomic bombs function, they still pose substantial risks. See this list of nuclear close calls.
Interpretability implicitly assumes that the AI model does not optimize in a way that is adversarial to the user. Consider being able to read the mind of a psychopath like Voldemort. Would this make you feel safe? The initial step remains to box him. However, a preferable scenario would be not having to confront this situation at all. (this last claim is probably the most important lesson - see Preventive measures).

Pytorch hooks can be used to study the internals of models. Are they going to be sufficient? *Idk, but* *Hook Me up Baby, from the album “Take Me as I Am” could be the national anthem of interp.*

That is why focusing on coordination is crucial! There is a level of coordination above which we don’t die - there is no such threshold for interpretability. We currently live in a world where coordination is way more valuable than interpretability techniques. So let’s not forget that non-alignment aspects of AI safety are key! AI alignment is only a subset of AI safety! (I’m planning to deep-dive more into this in a following post).

A version of this argument applies to "alignment" in general and not just interp and those considerations will heavily influence my recommendations for technical agendas.

Technical Agendas with better ToI

Interp is not such a bad egg, but opportunity costs can be huge (especially for researchers working in big labs).

I’m not saying we should stop doing technical work. Here's a list of technical projects that I consider promising (though I won't argue much for these alternatives here):

Technical works used for AI Governance. A huge amount of technical and research work needs to be done in order to make regulation robust and actually useful. Mauricio’s AI Governance Needs Technical Work, or the governance section of AGI safety career advice by Richard Ngo is really great : “It’s very plausible that, starting off with no background in the field, within six months you could write a post or paper which pushes forward the frontier of our knowledge on how one of those topics is relevant to AGI governance.”
- For example, each of the measures proposed in the paper towards best practices in AGI safety and governance: A survey of expert opinion could be a pretext for creating a specialized organization to address these issues, such as auditing, licensing, and monitoring.
- Scary demos (But this shouldn't involve gain-of-function research. There are already many powerful AIs available. Most of the work involves video editing, finding good stories, distribution channels, and creating good memes. Do not make AIs more dangerous just to accomplish this.).
- In the same vein, Monitoring for deceptive alignment is probably good because “AI coordination needs clear wins”.
- Interoperability in AI policy, and good definitions usable by policymakers.
- Creating benchmarks for dangerous capabilities.
- Here’s a list of other ideas
Characterizing the technical difficulties of alignment. (Hold Off On Proposing Solutions “Do not propose solutions until the problem has been discussed as thoroughly as possible without suggesting any.”)
- Creating the IPCC of AI Risks
- More red-teaming of agendas
- Explaining problems in alignment.
Adversarial examples, adversarial training, latent adversarial training (the only end-story I’m kind of excited about). For example, the papers "Red-Teaming the Stable Diffusion Safety Filter" or "Universal and Transferable Adversarial Attacks on Aligned Language Models" are good (and pretty simple!) examples of adversarial robustness works which contribute to safety culture.
Technical outreach. AI Explained and Rob Miles have plausibly reduced risks more than all interpretability research combined.
In essence, ask yourself: “What would Dan Hendrycks do?”
- Technical newsletter, non-technical newsletters, benchmarks, policy recommendations, risks analysis, banger statements, courses and technical outreach.
- He is not doing interp. Checkmate!

In short, my agenda is "Slow Capabilities through a safety culture", which I believe is robustly beneficial, even though it may be difficult. I want to help humanity understand that we are not yet ready to align AIs. Let's wait a couple of decades, then reconsider.

And if we really have to build AGIs and align AIs, it seems to me that it is more desirable to aim for a world where we don't need to probe into the internals of models. Again, prevention is better than cure.

Conclusion

I have argued against various theories of impact of interpretability, and proposed some alternatives. I believe working back from the different risk scenarios and red-teaming the theories of impact gives us better clarity and a better chance at doing what matters. Again, I hope this document opens discussions, so feel free to respond in parts. There probably should be a non-zero amount of researchers working on interpretability, this isn’t intended as an attack, but hopefully prompts more careful analysis and comparison to other theories of impact.

We already know some broad lessons, and we already have a general idea of which worlds will be more or less dangerous.Some ML researchers in top labs aren't even aware of, or acknowledging, that AGI is dangerous, that connecting models to the internet, encouraging agency, doing RL and maximizing metrics isn't safe in the limit.

Until civilization catches up to these basic lessons, we should avoid playing with fire, and should try to slow down the development of AGIs as much as possible, or at least steer towards worlds where it’s done only by extremely cautious and competent actors.

Perhaps the main problem I have with interp is that it implicitly reinforces the narrative that we must build powerful, dangerous AIs, and then align them. For X-risks, prevention is better than cure. Let’s not build powerful and dangerous AIs. We aspire for them to be safe, by design.

Appendix

Related works

There is a vast academic literature on the virtues and academic critiques of interpretability (see this page for plenty of references), but relatively little holistic reflection on interpretability as a strategy to reduce existential risks.

The most important articles presenting arguments for interpretability:

Against interpretability

The Engineer’s Interpretability Sequence

I originally began my investigation by rereading “The Engineer’s Interpretability Sequence”, in which Stephen Casper raises many good critiques of interpretability research, and this was really illuminating.

Interpretability tools lack widespread use by practitioners in real applications.

No interpretability technique is yet publicly known to have been used in production in SOTA models such as ChatGPT.
There have been interpretability studies of SOTA multimodal models such as CLIP in the past, but these studies are only descriptive.
The efficient market hypothesis: The technique used for the censorship filter of the Stable Diffusion model was a vulgar cosine similarity threshold between generated image embeddings and a list of taboo concepts. Yes, this may seem a bit ridiculous, but at least there is a filter, and it appears that interp has not yet been able to provide more convenient tools than this.

Broad critiques. He explains that interp is generally not scaling, relying too much on humans, failing to combine techniques. He also criticizes mech interp, which may not be the best way of doing interp, because of cherry-picking, focusing only on toy examples and lack of scalability, and failing to do useful things.

Methodological problems:

He points out, "The root cause of this has much to do with interpretability research not being approached with as much engineering rigor as it ought to be."
One good point to note is that since the publication of his sequence, certain methods have become more rigorous. For example, Causal methods, activation patching, probing only with linear probes are more rigorous and promising than descriptive techniques.
But I also believe that interpretability may be fundamentally flawed [Section What does the end story of interp look like].

Cognitive Emulations - Explainability By design

If interpretability was really a bottleneck, we could use cognitive emulation, which, in my opinion, allows way better explainability/transparency than interpretability will ever get us.

Interpretability is needed only if one module of the cognitive emulation is deceptive. Then maybe you can use interpretability to explain this module. But I'm confident models will need Chain of Thought for the foreseeable future. So interpretability is not 'urgent' research.

If we don’t use Reinforcement learning on this cognitive emulation, and if we only prompt in English the different submodules, the cognitive emulation won’t create whole “new languages” or be able to obfuscate a lot of its thoughts.

Yes, there are problems with this proposal:

CoEms don't preclude steganography or obfuscation.
Even when asked to reason with Chain-of-Thought (CoT), LLMs are able to skip some inference steps to some extent (see Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting). However, if we don’t train models to skip a zillion inference steps, they won’t be able to obfuscate their thoughts efficiently.

But we could further reduce these risks with:

Process-based training, to ensure that the systems are accustomed to detailing their thoughts in much greater detail (and not being able to skip too many steps). It seems much more feasible than interpretability (and the alignment tax even seems negative - for example see those progress for Mathematical reasoning)
selection inference: “CoT has access to the whole context, so each reasoning step is not necessarily causally connected to the last. But selection inference enforces a structure where each reasoning step necessarily follows from the last, and therefore the whole reasoning chain is causal.” from Sid Black, CTO of Conjecture.
Other ideas were listed in section “Steering the world towards transparency”.

Spicy: However, cognitive emulation will quite likely be an engineering nightmare, facing significant robustness issues that are always present in small models. The alignment tax will be higher than for end-to-end systems, making it unlikely that we will ever use this technology. The bottleneck is probably not interp, but rather an ecosystem of preventive safety measures and a safety culture. Connor Leahy, CEO of Conjecture, explaining the difficulties of the problem during interviews and pushing towards a safety culture, is plausibly more impactful than the entire CoEm technical agenda.

Detailed Counter Answers to Neel’s list

Here is Neel’s Longlist of Theories of Impact for Interpretability with critiques for each theory. Theories proposed by Neel are displayed in italics, whereas my critiques are rendered in standard font.

Force-multiplier on alignment research: We can analyse a model to see why it gives misaligned answers, and what's going wrong. This gets much richer data on empirical alignment work, and lets it progress faster.
- I think this "force multiplier in alignment research" theory is valid, but is conditioned on the success of the other theories of impact, which imho are almost all invalid.
- Conceptual advancements are more urgent It's better to think conceptually about what misalignment means rather than focusing on interp. [Section What does the end story of interpretability look like?]
- Dual Use: Force-multiplier on capability research.
Better prediction of future systems: Interpretability may enable a better mechanistic understanding of the principles of how ML systems work, and how they change with scale, analogous to scientific laws. This allows us to better extrapolate from current systems to future systems, in a similar sense to scaling laws. Eg, observing phase changes a la induction heads shows us that models may rapidly gain capabilities during training
- Critiqued in section “Interp is not a good predictor of future systems”
Auditing: We get a Mulligan. After training a system, we can check for misalignment, and only deploy if we're confident it's safe
- Not the most direct way. This ToI targets outer misalignment, the next one targets inner misalignment. But currently, people who are auditing for outer alignment do not use interpretability. They evaluate the model, they make the model speak and look if it is aligned with behavioral evaluations. Interpretability has not been useful in finding GPT’s jailbreaks.
- To date, I still don't see how we would proceed with interp to audit GPT-4.
Auditing for deception: Similar to auditing, we may be able detect deception in a model. This is a much lower bar than fully auditing a model, and is plausibly something we could do with just the ability to look at random bits of the model and identify circuits/features - I see this more as a theory of change for 'worlds where interpretability is harder than I hope'.
- Critiqued in section “Auditing deception with interp is out of reach”
Enabling coordination/cooperation: If different actors can interpret each other's systems, it's much easier to trust other actors to behave sensibly and coordinate better
- Not the most direct way. If you really want coordination and cooperation, you need to help with AI governance and outreach of experts and researchers. The statement on AI risks has enabled more coordination than interp will probably never get us.
Empirical evidence for/against threat models: We can look for empirical examples of theorized future threat models, eg inner misalignment
- Coordinating work on threat models: If we can find empirical examples of eg inner misalignment, it seems much easier to convince skeptics this is an issue, and maybe get more people to work on it.
  - Cicero or poker models are already capable of masking pieces of information or bluffing to play poker. From there, I don't know what it would mean to show canonical inner misalignment to non-technical people.
  - This focuses too much on deceptive alignment, and this will probably be too late if we get to this point.
- Coordinating a slowdown: If alignment is really hard, it seems much easier to coordinate caution/a slowdown of the field with eg empirical examples of models that seem aligned but are actually deceptive
  - Not the most direct way. This is a good theory of change, but interp is not the only way to show that a model is deceptive.
Improving human feedback: Rather than training models to just do the right things, we can train them to do the right things for the right reasons
- Seems very different from current interpretability work.
- Not the most direct way. Process-based training, model psychology, or other scalable oversight techniques not relying on interp may be more effective.
Informed oversight: We can improve recursive alignment schemes like IDA by having each step include checking the system is actually aligned. Note: This overlaps a lot with 7. To me, the distinction is that 7 can be also be applied with systems trained non-recursively, eg today's systems trained with Reinforcement Learning from Human Feedback
- Yes, it's an improvement, but it's naive to think that the only problem with RLHF is just the issue of lack of transparency or deception. For example, we would still have agentic models (because agency is preferred by human preferences) and interpretability alone won't fix that. See the Compendium of problems with RLHF and Open Problems and Fundamental Limitations of RLHF for more details.
- Conceptual advances are more urgent. What does ‘checking the system is actually aligned’ really means? It's not clear at all.
Interpretability tools in the loss function: We can directly put an interpretability tool into the training loop to ensure the system is doing things in an aligned way. Ambitious version - the tool is so good that it can't be Goodharted. Less ambitious - The could be Goodharted, but it's expensive, and this shifts the inductive biases to favor aligned cognition.
- Dual Use, for obvious reasons, and this one is particularly dangerous.
- List of lethalities 27. Selecting for undetectability: “Optimizing against an interpreted thought optimizes against interpretability.”
Norm setting: If interpretability is easier, there may be expectations that, before a company deploys a system, part of doing due diligence is interpreting the system and checking it does what you want
- Not the most direct way. Evals, evals, evals.
- No need to wait for interpretability. We already roughly know what to do. We could conduct studies in line with Evaluating Dangerous Capabilities and the paper Model Evaluation for Extreme Risks, Towards Best Practices in AGI Safety and Governance, this last paper presenting 50 statements about what AGI labs should do, none mentioning interp.
Enabling regulation: Regulators and policy-makers can create more effective regulations around how aligned AI systems must be if they/the companies can use tools to audit them
- Same critique as 10. Norm setting
Cultural shift 1: If the field of ML shifts towards having a better understanding of models, this may lead to a better understanding of failure cases and how to avoid them
- Not the most direct way. Technical Outreach, communications, interviews or even probably standards and Benchmarks are way more direct.
Cultural shift 2: If the field expects better understanding of how models work, it'll become more glaringly obvious how little we understand right now
- Same critique as 12. Cultural shift 1.
- This is probably the opposite of what is happening now: People are fascinated by interpretability and continue to develop capabilities in large labs. I suspect that the well-known Distill journal has been very fascinating for a lot of people and has probably been a source of fascination for people entering the field of ML, thus accelerating capabilities.
- See the False sense of control section.
Epistemic learned helplessness: Idk man, do we even need a theory of impact? In what world is 'actually understanding how our black box systems work' not helpful?
- I don't know man, the worlds where we have limited resources, where we are funding constrained + Opportunity costs.
- Dual Use, refer to the section "Interpretability May Be Overall Harmful".
Microscope AI: Maybe we can avoid deploying agents at all, by training systems to do complex tasks, then interpreting how they do it and doing it ourselves
- Critique in section Microscope AI?.
Training AIs to interpret other AIs: Even if interpretability is really hard/labor intensive on advanced systems, if we can create aligned AIs near human level, we can give these interpretability tools and use them to interpret more powerful systems
- Object level: Training AI to interpret other AI, could be useful but would be already dangerous, and we are already in classes of scenarios that are super dangerous.
- Meta level: This scheme is very speculative. I do not want the survival of civilization to rely on it. Godzilla strategy is probably not a good strategy (though this is controversial).
Forecasting discontinuities: By understanding what's going on, we can predict how likely we are to see discontinuities in alignment/capabilities, and potentially detect a discontinuity while training/before deploying a system
- Mostly the same critiques as in section “Interp is not a good predictor of future systems”
Intervening on training: By interpreting a system during training, we can notice misalignment early on, potentially before it's good enough for strategies to avoid our notice such as deceptive alignment, gradient hacking, obfuscating its thoughts, etc.
- Mostly the same critiques as in section “Auditing deception with interp is out of reach”
Auditing a training run: By checking for misalignment early in training, we can stop training systems that seem misaligned. This gives us many more shots to make an aligned system without spending large amounts of capital, and e.g. allows us to try multiple different schemes, initialisations, etc. This essentially shifts the distribution of systems towards alignment.
- Mostly the same critiques as in section “Auditing deception with interp is out of reach”
Eliciting Latent Knowledges: Use the length of the shortest interpretability explanation of behaviors of the model as a training loss for ELK - the idea is that models with shorter explanations are less likely to include human simulations / you can tell if they do. (credit to Tao Lin for this one)
- Same critique as 9. Interpretability tools in the loss function.
- Same critique as 15. Microscope AI.
- Same critique as 16. Training AIs to interpret other AIs.

Case study of some cool interp papers

This section is more technical.

Stephen Casper lists a bunch of impressive interpretability papers, as of February 2023. Let's try to investigate whether these papers could be used in the future to reduce risks. For each article, I mention the corresponding end story, and the critic of this end story applies to the article.

Bau et al. (2018)

Bau et al. (2018): Reverse engineer and repurpose a GAN for controllable image generation.

Procedure: (video) We generate images of churches using a GAN. There are often trees in the generated images. We manually surround the trees, then find the units in the GAN that are mostly responsible for generating these image regions. After finding these regions, we perform an ablation of these units, and it turns out that the trees disappear.
End Story: Enumerative safety
Useful for outer alignment? Ideally, we could 1. Find features which are undesirable 2. Then remove parts of the network that are most linked to these features. This is a very limited form of alignment procedure, by ablation.
- Maybe we could use this kind of procedure to filter pornography, but why then train the network on pornographic images in the first place?
- Basically, this is the same strategy as enumerative safety which is criticized above.
Useful for inner alignment? Can we apply this to deception? No, because by definition, deception will not result in a difference in outputs, so we cannot apply this procedure.

Ghorbani et al. (2020)

Ghorbani et al. (2020): Identify and successfully ablate neurons responsible for biases and adversarial vulnerabilities.

Procedure: (video) It calculates the Shapley score of different units of a CNN and then removes the units with the highest Shapley value to maximize or minimize a metric. Removing certain units seems to make the network more robust to certain adversarial attacks.
End Story: Enumerative safety (and Reverse engineering?)
Useful for outer alignment? What would have happened if we had just added black women to the dataset? We can simply use a generative model for that and generate lots of images of black women. I'm almost certain that the technique used by OpenAI to remove biases in Dalle-2, does not rely on interp.
Useful for inner alignment? Can we apply this to deception? No, again because the first step in using Shapley value and this interpretability method is to find a behavioral difference, and we need first to create a metric of deception, which does not exist currently. So again we first need to find first a behavioral difference and some evidence of deception.

Burns et al. (2022)

Burns et al. (2022): Identify directions in latent space that were predictive of a language model saying false things.

Procedure: compare the probability of the ‘Yes’ token with the probability probed from the world model.
End story: Microscope AI
Useful for inner alignment?
- Extracting knowledge from near GPT-3 level AIs, mostly trained through self-supervised learning via next token prediction, is a misunderstanding.
- This technique requires a minimum of agency and is not just usable as an oracle.
  - Chain-of-thought will probably always be better. Currently, this technique barely performs better than next token prediction. Chain-of-thought performs much better, and it seems we have (obvious) theoretical reason to think so. So using GPTs as just an oracle won’t be competitive. This paper doesn't test the trivial baseline of just fine-tuning the model (which has been found to usually work better).
  - Agency is probably required. It seems unlikely that it will synthesize knowledge on its own in a world model during next-token prediction training. Making tests in the world, or reasoning in an open-ended way, is probably necessary to synthesize a proper truth feature in the world model in advanced GPT using continual learning.
- Conclusion: Yes, maybe in the future, if we create autonomous agents that conduct experiments and have their own world model, this kind of technique could probably be spot a mismatch between the world model oracle and what the model tells you. But if that were the case, we would probably already be in a very, very dangerous world. Civilization is not ready for this, and I still think that this method will be very brittle, and I prefer to aim for worlds where deception is unlikely. [section: Preventive measures]

Casper et al. (2022)

Casper et al. (2022): Identify hundreds of interpretable copy/paste attacks.

Procedure: We try to find adversarial attacks automatically. For each image x, we optimize the latent space of a GAN to complete a patch in the image, to optimize the probability of classification towards class y2. We do this for each possible (x, y2) and we observe what appears in the patch. Then we search for natural patches that have a similar embedding to the synthetic adversarial patch, and if the similarity is high enough, we test the natural patch and see if this natural patch breaks the classifier.
End Story: Relaxed Adversarial training?
Useful for inner alignment? Conceptually yes, but this is so brittle. And if the trigger mechanism for deceptive alignment is not generable by the GAN or is not in the distribution of natural images/text, we won’t find it.

Ziegler et al. (2022)

Ziegler et al. (2022): Debug a model well enough to greatly reduce its rate of misclassification in a high-stakes type of setting.

Procedure: The project consists in building a highly reliable injury classifier: “We started with a baseline classifier trained on some mildly injury-enriched (but otherwise mostly random) data. Then, over the course of several months, we tried various techniques to make it more reliable”. The aim here is to minimize the false negative rate, striving to make the classifier as reliable as possible.: we do not want to let any problematic sentence or example go unnoticed. In order to find adversarial examples, they experimented with the following techniques:
- 1. unaugmented humans: humans manually search for problematic sentences,
- 2. automatic paraphrasing: use of LLM to find variations of problematic sentences and progressively enrich the dataset.
- and 3. humans augmented with a rewriting tool. It is mainly this last technique that was useful.
End Story: Adversarial training and maybe Relaxed Adversarial training?
Robustness results:
- In-distribution: They started with a classifier that had a 2.5% false negatives, and ended up with a classifier that had 0.003% false negatives! They are almost reaching the limits of detection for this type of dataset.
- Out-of-distribution: The tool-assisted attack to generate an adversarial example now takes 26 minutes per example, which is only twice as long as the previous time of 13 minutes. This falls short of the paper's aim to gain many orders of magnitude in speedup. Unfortunately, even though the original dataset (in-distribution) is now almost completely well classified, it seems that it is still possible to easily create adversarial examples with automated tools, and we have only improved the defense by a factor of 2. But only a factor of 2 falls short of the aims of this project.
- Conclusion: Thus, improving a lot in-distribution performance by several orders of magnitude seems to not have much impact on out-of-distribution. Even though Redwood said that this project could have been conducted better, this is currently rather a negative result to improve the adversarial robustness (i.e. out-of-distribution Robustness). In addition, we can link these results to the paper Adversarial Policies Beat Superhuman Go AIs, Wang et al.. 2022, which studies adversarial attacks on the Katago AI, which is superhuman in the game of Go. They show that it is probably possible to find simple adversarial strategies even against very superhuman AIs. And as a consequence, it seems that even for very robust and powerful AIs, It may always be possible to find adversarial attacks.
How to use this for deception? Unclear. Do we watch what the AI does, tell it 'you are the lord of the universe, you can do whatever you want' or more subtle variations of this prompt and then see what happens? This is not really interpretability, it's much more prompt engineering with consistency checks or cyborgism than interp. (Beside, the interpretability technique that they used is quite underwhelming, only coloring tokens with the bigger gradient of the norm of the embedding of the token, and then basically iterating on synonymous of those important tokens).

^{^}
Is feature visualization useful? Some findings suggest no: Red Teaming Deep Neural Networks with Feature Synthesis Tools.
^{^}
GradCam: Maybe this paper? But this is still academic work.
^{^}
I have organized two hackathons centered around the topic of spurious correlations. I strongly nudged using interp, but unfortunately, nobody used it...Yes this claim is a bit weak, but still indicates a real phenomenon, see [section Lack of real applications]
^{^}
Note: I am not making any claims about ex-ante interp (also known as intrinsic interp), which has not been so far able to predict the future system either.
^{^}
Other weaker difficulties for auditing deception with interp: This is already too risky and Prevention is better than cure. 1) Moloch may still kill us:"auditing a trained model" does not have a great story for wins. Like, either you find that the model is fine (in which case it would have been fine if you skipped the auditing) or you find that the model will kill you (in which case you don't deploy your AI system, and someone else destroys the world instead). […] a capable lab would accidentally destroy the world because they would be trying the same approach but either not have those interpretability tools or not be careful enough to use them to check their trained model as well?” [Source Rohin Shah]. 2) We probably won’t be competent enough to fix our mistake: “in order for auditing the model to help (directly), you have to actually be pretty confident in your ability to understand and fix your mistakes if you find one. It's not like getting a coin to land Heads by flipping it again if it lands Tails - different AGI projects are not independent random variables, if you don't get good results the first time you won't get good results the next time unless you understand what happened. This means that auditing trained models isn't really appropriate for the middle of the skill curve.” [Source Charlie Steiner].
^{^}
From “Conditioning Generative Models. “Suppose you had a language model that you knew was in fact a good generative model of the world and that this property continued to hold regardless of what you conditioned it on. Furthermore, suppose you had some prompt that described some agent for the language model to simulate (Alice) that in practice resulted in aligned-looking outputs. Is there a way we could use different conditionals to get at whether or not Alice was deceptive (e.g. prompt the model with “DeepMind develops perfect transparency tools and provides an opportunity for deceptive models to come clean and receive a prize before they’re discovered.”)?”
^{^}
For example, what happens If you prompt a deceptive model with a Joyous prompt, and prompt the same deceptive model with a sad prompt and then take the difference, you obtain a Joyous Deceptive model?
^{^}
But at the same time, we could be pessimistic, because this good idea has been out there in the wild since Christiano described it in 2019. So either this idea does not work and we have not heard about it. Or the community has failed to recognize a pretty simple good idea.
^{^}
Causal scrubbing could be a good way for evaluating interp techniques using something other than intuition. However, this is only suitable for localization assessment and does not measure how understandable the system is for humans.
^{^}
“I was previously pretty dubious about interpretability results leading to capabilities advances. I've only really seen two papers which did this for LMs and they came from the same lab in the past few months. It seemed to me like most of the advances in modern ML (other than scale) came from people tinkering with architectures and seeing which modifications increased performance. But in a conversation with Oliver Habryka and others, it was brought up that as AI models are getting larger and more expensive, this tinkering will get more difficult and expensive. This might cause researchers to look for additional places for capabilities insights, and one of the obvious places to find such insights might be interpretability research.” from Peter barnett.
^{^}
Not quite! Hypotheses 4 (and 2?) are missing. Thanks to Diego Dorn for presenting this fun concept to me.
^{^}
This excludes the governance hackathon, though, this is only from the technical ones. Source: Esben Kran.

New to LessWrong?

Getting Started

FAQ

Library

^{^}

Is feature visualization useful? Some findings suggest no: Red Teaming Deep Neural Networks with Feature Synthesis Tools.

^{^}

GradCam: Maybe this paper? But this is still academic work.

^{^}

I have organized two hackathons centered around the topic of spurious correlations. I strongly nudged using interp, but unfortunately, nobody used it...Yes this claim is a bit weak, but still indicates a real phenomenon, see [section Lack of real applications]

^{^}

Note: I am not making any claims about ex-ante interp (also known as intrinsic interp), which has not been so far able to predict the future system either.

^{^}

Other weaker difficulties for auditing deception with interp: This is already too risky and Prevention is better than cure. 1) Moloch may still kill us:"auditing a trained model" does not have a great story for wins. Like, either you find that the model is fine (in which case it would have been fine if you skipped the auditing) or you find that the model will kill you (in which case you don't deploy your AI system, and someone else destroys the world instead). […] a capable lab would accidentally destroy the world because they would be trying the same approach but either not have those interpretability tools or not be careful enough to use them to check their trained model as well?” [Source Rohin Shah]. 2) We probably won’t be competent enough to fix our mistake: “in order for auditing the model to help (directly), you have to actually be pretty confident in your ability to understand and fix your mistakes if you find one. It's not like getting a coin to land Heads by flipping it again if it lands Tails - different AGI projects are not independent random variables, if you don't get good results the first time you won't get good results the next time unless you understand what happened. This means that auditing trained models isn't really appropriate for the middle of the skill curve.” [Source Charlie Steiner].

^{^}

From “Conditioning Generative Models. “Suppose you had a language model that you knew was in fact a good generative model of the world and that this property continued to hold regardless of what you conditioned it on. Furthermore, suppose you had some prompt that described some agent for the language model to simulate (Alice) that in practice resulted in aligned-looking outputs. Is there a way we could use different conditionals to get at whether or not Alice was deceptive (e.g. prompt the model with “DeepMind develops perfect transparency tools and provides an opportunity for deceptive models to come clean and receive a prize before they’re discovered.”)?”

^{^}

For example, what happens If you prompt a deceptive model with a Joyous prompt, and prompt the same deceptive model with a sad prompt and then take the difference, you obtain a Joyous Deceptive model?

^{^}

But at the same time, we could be pessimistic, because this good idea has been out there in the wild since Christiano described it in 2019. So either this idea does not work and we have not heard about it. Or the community has failed to recognize a pretty simple good idea.

^{^}

Causal scrubbing could be a good way for evaluating interp techniques using something other than intuition. However, this is only suitable for localization assessment and does not measure how understandable the system is for humans.

10.

^{^}

“I was previously pretty dubious about interpretability results leading to capabilities advances. I've only really seen two papers which did this for LMs and they came from the same lab in the past few months. It seemed to me like most of the advances in modern ML (other than scale) came from people tinkering with architectures and seeing which modifications increased performance. But in a conversation with Oliver Habryka and others, it was brought up that as AI models are getting larger and more expensive, this tinkering will get more difficult and expensive. This might cause researchers to look for additional places for capabilities insights, and one of the obvious places to find such insights might be interpretability research.” from Peter barnett.

11.

^{^}

Not quite! Hypotheses 4 (and 2?) are missing. Thanks to Diego Dorn for presenting this fun concept to me.

12.

^{^}

This excludes the governance hackathon, though, this is only from the technical ones. Source: Esben Kran.

Interpretability (ML & AI)AI

Frontpage

331 Ω 96

Mentioned in

348Shallow review of live agendas in alignment & safety

193Shallow review of technical AI safety, 2024

168How useful is mechanistic interpretability?

117Research directions Open Phil wants to fund in technical AI safety

112Charbel-Raphaël and Lucius discuss interpretability

Load More (5/24)

Against Almost Every Theory of Impact of Interpretability

4the gears to ascension

6Quadratic Reciprocity

1Sheikh Abdur Raheem Ali

New Comment

91 comments, sorted by

top scoring

Click to highlight new comments since: Today at 10:26 PM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

[-]Richard_Ngo2yΩ5112415

Strong disagree. This seems like very much the wrong type of reasoning to do about novel scientific research. Big breakthroughs open up possibilities that are very hard to imagine before those breakthroughs (e.g. imagine trying to describe the useful applications of electricity before anyone knew what it was or how it worked; or imagine Galileo trying to justify the practical use of studying astronomy).

Interpretability seems like our clear best bet for developing a more principled understanding of how deep learning works; this by itself is sufficient to recommend it. (Algon makes a similar point in another comment.) Though I do agree that, based on the numbers you gave for how many junior researchers' projects are focusing on interpretability, people are probably overweighting it.

I think this post is an example of a fairly common phenomenon where alignment people are too focused on backchaining from desired end states, and not focused enough on forward-chaining to find the avenues of investigation which actually allow useful feedback and give us a hook into understanding the world better. (By contrast, most ML researchers are too focused on the latter.)

Perhaps the main problem

... (read more)

[-]Charbel-Raphaël2y5530

This seems like very much the wrong type of reasoning to do about novel scientific research. Big breakthroughs open up possibilities that are very hard to imagine before those breakthroughs.

What type of reasoning do you think would be most appropriate?

This proves too much. The only way to determine whether a research direction is promising or not is through object-level arguments. I don't see how we can proceed without scrutinizing the agendas and listing the main difficulties.

this by itself is sufficient to recommend it.

I don't think it's that simple. We have to weigh the good against the bad, and I'd like to see some object-level explanations for why the bad doesn't outweigh the good, and why the problem is sufficiently tractable.

Interpretability seems like our clear best bet for developing a more principled understanding of how deep learning works;

Maybe. I would still argue that other research avenues are neglected in the community.

not focused enough on forward-chaining to find the avenues of investigation which actually allow useful feedback and give us a hook into understanding the world better

I provided plenty of technical research direction in the "preventi... (read more)

5Richard_Ngo2y

See the discussion between me and interstice upthread for a type of argument that feels more productive. I agree (and mentioned so in my original comment). This post would have been far more productive if it had focused on exploring them. The things you should be looking for, when it comes to fundamental breakthroughs, are deep problems demonstrating fascinating phenomena, and especially cases where you can get rapid feedback from reality. That's what we've got here. If that's not object-level enough then your criterion would have ruled out almost all great science in the past. I wouldn't have criticized it so strongly if you hadn't listed it as "Perhaps the main problem I have with interp".

[-]Charbel-Raphaël2y1113

This post would have been far more productive if it had focused on exploring them.

So the sections "Counteracting deception with only interp is not the only approach" and "Preventive measures against deception", "Cognitive Emulations" and "Technical Agendas with better ToI" don't feel productive? It seems to me that it's already a good list of neglected research agendas. So I don't understand.

if you hadn't listed it as "Perhaps the main problem I have with interp"

In the above comment, I only agree with "we shouldn't do useful work, because then it will encourage other people to do bad things", and I don't agree with your critique of "Perhaps the main problem I have with interp..." which I think is not justified enough.

[-]Richard_Ngo2y1310

So the sections "Counteracting deception with only interp is not the only approach" and "Preventive measures against deception", "Cognitive Emulations" and "Technical Agendas with better ToI" don't feel productive? It seems to me that it's already a good list of neglected research agendas. So I don't understand.

You've listed them, but you haven't really argued that they're valuable, you're mostly just asserting stuff like Rob Miles having a bigger impact than most interpretability researchers, or the best strategy being copying Dan Hendrycks. But since I disagree with the assertions, these sections aren't very useful; they don't actually zoom in on the positive case for these research directions.

(The main positive case I'm seeing seems to be "anything which helps with coordination is really valuable". And sure, coordination is great. But most coordination-related research is shallow: it helps us do things now, but doesn't help us figure out how to do things better in the long term. So I think you're overstating the case for it in general.)

[-]Charbel-Raphaël2y1210

I agree that I haven't argued the positive case for more governance/coordination work (and that's why I hope to do a next post on that).

We do need alignment work, but I think the current allocation is too focused on alignment, whereas AI X-Risks could arrive in the near future. I'll be happy to reinvest in alignment work once we're sure we can avoid X-Risks from misuses and grossly negligent accidents.

[-]interstice2y3432

Interpretability seems like our clear best bet for developing a more principled understanding of how deep learning works

If our goal is developing a principled understanding of deep learning, directly trying to do that is likely to be more effective than doing interpretability in the hope that we will develop a principled understanding as a side effect. For this reason I think most alignment researchers have too little awareness of various attempts in academia to develop "grand theories" of deep learning such as the neural tangent kernel. I think the ideal use for interpretability in this quest is as a way of investigating how the existing theories break down - e.g. if we can explain 80% of a given model's behavior with the NTK, what are the causes of the remaining 20%? I think of interpretability as basically collecting many interesting data points; this type of collection is essential, but it can be much more effective when it's guided by a provisional theory which tells you what points are expected and what are interesting anomalies which call for a revision of the theory, which in turn guides further exploration, etc.

[-]Richard_Ngo2y128

I agree that work like NTK is worth thinking about. But I disagree that it's a more "direct" approach to a principled understanding of deep learning. To find a "grand theory" of deep learning, we're going to need to connect our understanding of neural networks to our understanding of the real world, and I don't think NTKs or other related things can help very much with that step - for roughly the same reasons that statistical learning theory wasn't very helpful (and was in fact anti-helpful) in predicting the success of deep neural networks.

Btw, this isn't a general-purpose critique of theoretical work - e.g. it doesn't apply to this paper by Lin, Tegmark and Rolnick, which actually ties neural network success to properties of the real world like symmetry, locality, and compositionality. This is the sort of thing which I can much more easily imagine leading to alignment breakthroughs.

I think of interpretability as basically collecting many interesting data points

I'd agree if interpretability were just about "here's a circuit for recognizing X" (although even then, the concept of circuits itself was nontrivial to develop), but in fact a lot of the most promising work has been on more important and fundamental phenomena like superposition and induction heads.

[-]interstice2y11-2

we're going to need to connect our understanding of neural networks to our understanding of the real world

The NTK and related theories aim to go from "SGD finds a giant blob of parameters that performs well on the data for some reason" to "SGD finds a solution with such-and-such clean mathematical characterization". To fully explain the success of deep learning you do then have to relate the clean mathematical characterization to the real world, but I think this can be done separately to some extent and is less of a bottleneck on progress. My #2 use case for interpretability would be doing stuff like this - basically conceptual/experimental investigation of the types of solutions favored by a given mathematical theory, with the goal of obtaining a high-level story about "why it works in the real world". Plus attempts to carry out alignment/interpretability/ELK tasks in the simplified setting.

This is the sort of thing which I can much more easily imagine leading to alignment breakthroughs

Hmm, it's been a while since I looked at this paper but if I recall it doesn't really try to make any specific predictions about the inductive bias of neural nets in practice, it's more like a series of suggestive analogies. That's fine, but I think that sort of thing is more likely to be productive if guided by a more detailed theory.

[-]Zach Furman2y169

I can't speak for Richard, but I think I have a similar issue with NTK and adjacent theory as it currently stands (beyond the usual issues). I'm significantly more confident in a theory of deep learning if it cleanly and consistently explains (or better yet, predicts) unexpected empirical phenomena. The one that sticks out most prominently in my mind, that we see constantly in interpretability, is this strange correspondence between the algorithmic "structure" we find in trained models (both ML and biological!) and "structure" in the data generating process.

That training on Othello move sequences gets you an algorithmic model of the game itself is surprising from most current theoretical perspectives! So in that sense I might be suspicious of a theory of deep learning that fails to "connect our understanding of neural networks to our understanding of the real world". This is the single most striking thing to come out of interpretability, in my opinion, and I'm worried about a "deep learning theory of everything" if it doesn't address this head on.

That said, NTK doesn't promise to be a theory of everything, so I don't mean to hold it to an unreasonable standard. It does what it says... (read more)

[-]interstice2y110

I intended my comment to apply to "theories of deep learning" in general, the NTK was only meant as an example. I agree that the NTK has problems such that it can at best be a 'provisional' grand theory. The big question is how to think about feature learning. At this point, though, there are a lot of contenders for "feature learning theories" - the Maximal Update Parameterization, Depth Corrections to the NTK, Perturbation Theory, Singular Learning Theory, Stochastic Collapse, SGD-Induced Sparsity....

So although I don't think the NTK can be a final answer, I do like the idea of studying it in more depth - it provides a feature-learning-free baseline against which we can compare actual neural networks and other potential 'grand theories'. Exactly which phenomena can we not explain with the NTK, and which theory best predicts them?

6Richard_Ngo2y

Strong upvote to Zach's comment, it basically encapsulates my view (except that I don't know what the "tractable-looking directions" he mentions are - Zach, can you elaborate?) I'd turn that around: is there any explanation of why LLMs can do real-world task X and not real-world task Y that appeals to NTKs? (Not a rhetorical question: there may well be, I just haven't seen one.)

9Zach Furman2y

Yeah, I can expand on that - this is obviously going be fairly opinionated, but there are a few things I'm excited about in this direction. The first thing that comes to mind here is singular learning theory. I think all of my thoughts on DL theory are fairly strongly influenced by it at this point. It definitely doesn't have all the answers at the moment, but it's the single largest theory I've found that makes deep learning phenomena substantially "less surprising" (bonus points for these ideas preceding deep learning). For instance, one of the first things that SLT tells you is that the effective parameter count (RLCT) of your model can vary depending on the training distribution, allowing it to basically do internal model selection - the absence of bias-variance tradeoff, and the success of overparameterized models, aren't surprising when you internalize this. The "connection to real world structure" aspect hasn't been fully developed here, but it seems heavily suggested by the framework, in multiple ways - for instance, hierarchical statistical models are naturally singular statistical models, and the hierarchical structure is reflected in the singularities. (See also Tom Waring's thesis). Outside of SLT, there's a few other areas I'm excited about - I'll highlight just one. You mentioned Lin, Tegmark, and Rolnick - the broader literature on depth separations and the curse of dimensionality seems quite important. The approximation abilities of NNs are usually glossed over with universal approximation arguments, but this can't be enough - for generic Lipschitz functions, universal approximation takes exponentially many parameters in the input dimension (this is a provable lower bound). So there has to be something special about the functions we care about in the real world. See this section of my post for more information. I'd highlight Poggio et al. here, which is the paper in the literature closest to my current view on this. This isn't a complete list, eve

2interstice2y

FWIW most potential theories of deep learning are able to explain these, I don't think this distinguishes SLT particularly much.

1Zach Furman2y

Agreed - that alone isn’t particularly much, just one of the easier things to express succinctly. (Though the fact that this predates deep learning does seem significant to me. And the fact that SLT can delineate precisely where statistical learning theory went wrong here seems important too.) Another is that can explain phenomena like phase transitions, as observed in e.g. toy models of superposition, at a quantitative level. There’s also been a substantial chunk of non-SLT ML literature that has independently rediscovered small pieces of SLT, like failures of information geometry, importance of parameter degeneracies, etc. More speculatively, but what excites me most, is that empirical phenomena like grokking, mode connectivity, and circuits seem to intuitively fit in SLT nicely, though this hasn’t been demonstrated rigorously yet.

3interstice2y

I don't think there are any. Of course much the same could be said of other deep learning theories and most(all?) interpretability work. The difference, as far as I can tell, is that there is a clear pathway to getting such explanations from the NTK: you'd want to do a spectral analysis of the sorts of functions learnable by transformer-NTKs. It's just that nobody has bothered to do this! That's why I think this line of research is neglected relative to interpretability or developing a new theoretical analysis of deep learning. Another obvious thing to try: NTKs often empirically perform comparably well to finite networks, but are usually are a few percentage points worse in accuracy. Can we say anything about the examples where the NTK fails? Do they particularly depend on 'feature learning'? I think NTKs are a good compliment to mechinterp in this regard, since they treat the weights at each neuron as independent of all others, so they provide a good indicator of exactly which examples may require interacting 'circuits' to be correctly classified.

2Noosphere8910mo

A note is that as it turns out, OthelloGPT learned a bag of heuristics, and there was no clean algorithm: https://www.lesswrong.com/posts/gcpNuEZnxAPayaKBY/othellogpt-learned-a-bag-of-heuristics-1

2kave2y

What is the work that finds the algorithmic model of the game itself for Othello? I'm aware of (but not familiar with) some interpretability work on Othello-GPT (Neel Nanda's and Kenneth Li), but thought it was just about board state representations.

2Zach Furman2y

Yeah, that was what I was referring to. Maybe “algorithmic model” isn’t the most precise - what we know is that the NN has an internal model of the board state that’s causal (i.e. the NN actually uses it to make predictions, as verified by interventions). Theoretically it could just be forming this internal model via a big lookup table / function approximation, rather than via a more sophisticated algorithm. Though we’ve seen from modular addition work, transformer induction heads, etc that at least some of the time NNs learn genuine algorithms.

1kave2y

I think that means one of the following should be surprising from theoretical perspectives: 1. That the model learns a representation of the board state 1. Or that a linear probe can recover it 2. That the board state is used causally Does that seem right to you? If so, which is the surprising claim? (I am not that informed on theoretical perspectives)

1Zach Furman2y

I think the core surprising thing is the fact that the model learns a representation of the board state. The causal / linear probe parts are there to ensure that you've defined "learns a representation of the board state" correctly - otherwise the probe could just be computing the board state itself, without that knowledge being used in the original model. This is surprising to some older theories like statistical learning, because the model is usually treated as effectively a black box function approximator. It's also surprising to theories like NTK, mean-field, and tensor programs, because they view model activations as IID samples from a single-neuron probability distribution - but you can't reconstruct the board state via a permutation-invariant linear probe. The question of "which neuron is which" actually matters, so this form of feature learning is beyond them. (Though there may be e.g. perturbative modifications to these theories to allow this in a limited way).

2interstice2y

Permutation-invariance isn't the reason that this should be surprising. Yes, the NTK views neurons as being drawn from an IID distribution, but once they have been so drawn, you can linearly probe them as independent units. As an example, imagine that our input space consisted of five pixels, and at initialization neurons were randomly sensitive to one of the pixels. You would easily be able to construct linear probes sensitive to individual pixels even though the distribution over neurons is invariant over all the pixels. The reason the Othello result is surprising to the NTK is that neurons implementing an "Othello board state detector" would be vanishingly rare in the initial distribution, and the NTK thinks that the neuron function distribution does not change during training.

1Zach Furman2y

Yeah, that's probably the best way to explain why this is surprising from the NTK perspective. I was trying to include mean-field and tensor programs as well (where that explanation doesn't work anymore). Yeah, this is a good point. What I meant to specify wasn't that you can't recover any permutation-sensitive data at all (trivially, you can recover data about the input), but that any learned structures must be invariant to neuron permutation. (Though I'm feeling sketchy about the details of this claim). For the case of NTK, this is sort of trivial, since (as you pointed out) it doesn't really learn features anyway. By the way, there are actually two separate problems that come from the IID assumption: the "independent" part, and the "identically-distributed" part. For space I only really mentioned the second one. But even if you deal with the identically distributed assumption, the independence assumption still causes problems.This prevents a lot of structure from being representable - for example, a layer where "at most two neurons are activated on any input from some set" can't be represented with independently distributed neurons. More generally a lot of circuit-style constructions require this joint structure. IMO this is actually the more fundamental limitation, though takes longer to dig into.

2interstice2y

The same argument applies - if the distribution of intermediate neurons shifts so that Othello-board-state-detectors have a reasonably high probability of being instantiated, it will be possible to construct a linear probe detecting this, regardless of the permutation-invariance of the distribution. This is a more reasonable objection(although actually, I'm not sure if independence does hold in the tensor programs framework - probably?)

1Zach Furman2y

Yeah, this "if" was the part I was claiming permutation invariance causes problems for - that identically distributed neurons probably couldn't express something as complicated as a board-state-detector. As soon as that's true (plus assuming the board-state-detector is implemented linearly), agreed, you can recover it with a linear probe regardless of permutation-invariance. I probably should've just gone with that one, since the independence barrier is the one I usually think about, and harder to get around (related to non-free-field theories, perturbation theory, etc). My impression from reading through one of the tensor program papers a while back was that it still makes the IID assumption, but there could be some subtlety about that I missed.

1kave2y

Thanks! The permutation-invariance of a bunch of theories is a helpful concept

[-]scasper2y2726

I get the impression of a certain of motte and bailey in this comment and similar arguments. From a high-level, the notion of better understanding what neural networks are doing would be great. The problem though seems to be that most of the SOTA of research in interpretability does not seem to be doing a good job of this in a way that seems useful for safety anytime soon. In that sense, I think this comment talks past the points that this post is trying to make.

[-]Richard_Ngo2y3517

I wish the original post had been more careful about its claims, so that I could respond to them more clearly. Instead there's a mishmash of sensible arguments, totally unjustified assertions, and weird strawmen (like "I don't understand how “Looking at random bits of the model and identify circuits/features” will help with deception"). And in general a lot of this is of the form "I don't see how X", which is the format I'm objecting to, because of course you won't see how X until someone invents a technique to X.

This is exacerbated by the meta-level problem that people have very different standards for what's useful (e.g. to Eliezer, none of this is useful), and also standards for what types of evidence and argument they accept (e.g. to many ML researchers, approximately all arguments about long-term theories of impact are too speculative to be worth engaging in depth).

I still think that so many people are working on interpretability mainly because they don't see alternatives that are as promising; in general I'd welcome writing that clearly lays out solid explanations and intuitions about why those other research directions are worth working on, and think that this would be the best way to recalibrate the field.

5scasper2y

Thanks for the reply. This sounds reasonable to me. On the last point, I tried my best to do that here, and I think there is a relatively high ratio of solid explanations to unsolid ones. Overall, I think that the hopes you have for interpretability research are good, and I hope it works out. One of the biggest things that I think is a concern though is that people seem to have been making similar takes with little change for 7+ years. But I just don't think there have been a number of wins from this research that are commensurate with the effort put into it. And I assume this is expected under your views, so probably not a crux.

[-]Charlie Steiner2y*104

EDIT: Nuance of course being impossible, this no doubt comes off as rude - and is in turn a reaction to an internet-distorted version of what you actually wrote. Oh well, grain of salt and all that.

The way you get safety by design is understanding what's going on inside the neural networks.

This is equivocation. There are some properties of what's going on inside a NN that are crucial to reasoning about its safety properties, and many, many more that are irrelevant.

I'm actually strongly reminded of a recent comment about LK-99, where someone remarked that a good way to ramp up production of superconductors would be to understand how superconductors work, because then we could design one that's easier to mass-produce.

Except:

What we normally think of as "understanding how superconductors work" is not a sure thing, it's hard and sometimes we don't find satisfactory models.
Even if we understand how superconductors work, designing new ones with economically useful properties is an independent problem that's also hard and possible to fail at for decades.
There are many other ways to make progress in discovering superconductors and ramping up their production. These ways are sometimes purel

... (read more)

9dr_s2y

The tricky part being that in the AGI alignment discourse, if you believe in self-improvement runaway feedback loops, there is no good. There is only perfect, or extinction. This might be a bit extreme but we don't really know that for sure either.

4the gears to ascension2y

Note that a wrench current paradigms throw in this is that self-improvement processes would not look uniquely recursive, since all training algorithms sort of look like "recursive self improvement". instead, RSI is effectively just "oh no, the training curve was curved differently on this training run", which is something most likely to happen in open world RL. But I agree, open world RL has the ability to be suddenly surprising in capability growth. and there wouldn't be much of an opportunity to notice the problem unless we've already solved how to intentionally bound capabilities in RL. There has been some interesting work on bounding capability growth in safe RL already, though. I haven't looked closely at it, I wonder if any of it is particularly good. edit: note that I am in fact claiming that after miri deconfuses us, it'll turn out to apply to ordinary gradient updates

1Charlie Steiner2y

Au contraire, the perfect future doesn't exist, but good ones do.

5dr_s2y

This isn't about "perfect futures" though, but about perfect AGIs specifically. Consider a future that goes like this: 1. the AI's presence and influence over us evolves exponentially according to a law dAIdt=γAI, 2. the exponent γ expresses the amount of misalignment; if the AI is aligned and fully under our control, γ=0, otherwise γ>0, then in that future, anything less than perfect alignment ends with us overwhelmed by the AI, sooner or later. This is super simplistic, but the essence is that if you keep around something really powerful that might just decide to kill you, you probably want to be damn sure it won't. That's what "perfect" here means; it's not fine if it just wants to kill you a little bit. So if your logic is correct (and indeed, I do agree with you on general matters of ethics), then perhaps we just shouldn't build AGI at all, because we can't get it perfect, and if it's not perfect it'll probably be in too precarious a balance with us for it to persist for long.

2Charlie Steiner2y

Ah, I see more of what you mean. I agree an AI's influence being small is unstable. And this means that the chance of death by AI being small is also unstable. But I think the risk is one-time, not compounding over time. A high-influence AI might kill you, but if it doesn't, you'll probably live a long and healthy life (because of arguments like stability of value being a convergent instrumental goal). It's not that once an AI becomes high-influence, there's an exponential decay of humans, as every day it makes a new random mutation to its motivations.

2dr_s2y

I don't think that's necessarily true. There's two ways in which I think it can compound: 1. if the AGI will self-upgrade, or design more advanced AGI, the problem repeats, and the AGI can make mistakes, same as us, though probably less obvious mistakes 2. it is possible to imagine an AGI that stays generally aligned but has a certain probability of being triggered on some runaway loop in which it loses its alignment. Like it will come up with pretty aligned solutions most of the time but there is something, some kind of problem or situation, that is so out-of-domain it sends it off the path of insanity, and it's unrecoverable, and we don't know how or when that might occur. Also, it might simply be probabilistic - any non-fully deterministic AGI probably wouldn't literally have no access to non-aligned strategies, but merely assign them very small logits. So in theory that's still a finite but non-zero possibility that it goes into some kind of "kill all humans" strategy path. And even if you interpret this as one-shot (did you align it right or not on creation?), the effects might not be visible right away.

0Noosphere892y

Now that I think about it, this is the main problem a lot of LW thinking and posting has: It implicitly thinks that only a perfect, watertight solution to alignment is sufficient to guarantee human survival, despite the fact that most solutions to problems don't have to be perfect to work, and even the cases where we do face against an adversary, imperfect but fast solutions win out over perfect, very slow solutions, and in particular ignores that multiple solutions to alignment can fundamentally stack. In general, I feel like the biggest flaw of LW is it's perfectionism, and the big reason why Michael Nielsen pointed out that alignment is extremely accelerationist in practice is that OpenAI implements a truth that LWers like Nate Soares and Eliezer Yudkowsky, as well as the broader community doesn't: Alignment approaches don't need to be perfect to work, and having an imperfect safety and alignment plan is much better than no plan at all. Links are below: https://www.lesswrong.com/posts/8Q7JwFyC8hqYYmCkC/link-post-michael-nielsen-s-notes-on-existential-risk-from https://www.beren.io/2023-02-19-The-solution-to-alignment-is-many-not-one/

[-]quetzal_rainbow2y122

It's literally point -2 in List of Lethalities that we don't need "perfect" alignment solution, we just don't have any.

2Noosphere892y

I basically just disagree with this entirely, unless you don't count stuff like RLHF or DPO as alignment. More generally, if we grant that we don't need perfection, or arbitrarily good alignment, at least early on, then I think this implies that alignment should be really easy, and the p(Doom) numbers are almost certainly way too high, primarily because it's often doable to solve problems of you don't need perfect or arbitrarily good solutions. So I basically just disagree with Eliezer here.

2abramdemski2y

It seems really easy to spell out worldviews where "we don't need perfection, or arbitrarily good alignment" but yet "alignment should be really easy". To give a somewhat silly example based on the OP, I could buy Enumerative Safety in principle -- so if we can check all the features for safety, we can 100% guarantee the safety of the model. It then follows that if we can check 95% of the features (sampled randomly) then we get something like a 95% safety guarantee (depending on priors). But I might also think that properly "checking" even one feature is really, really hard. So I don't buy the claimed implication: "we don't need perfection" does not imply "alignment should be really easy". Indeed, I think the implication quite badly fails.

2Noosphere892y

I'll admit I overstated it here, but my claim is that once you remove the requirement for arbitrarily good/perfect solutions, it becomes easier to solve the problem. Sometimes, it's still impossible to solve the problem, but it's usually solvable once you drop a perfectness/arbitrarily good requirement, primarily because it loosens a lot of constraints. I agree it isn't a logical implication, but I suspect your example is very misleading, and that more realistic imperfect solutions won't have this failure mode, so I'm still quite comfortable with using it as an implication that isn't 100% accurate, but more like 90-95+% accurate.

2abramdemski2y

I mean, yeah, I agree with all of this as generic statements if we ignore the subject at hand. I agree the example sucks and only serves to prove that it is not a logical implication. A better example would be, like, the Goodhart model of AI risk, where any loss function that we optimize hard enough to get into superintelligence would probably result in a large divergence between what we get and what we actually want, because optimization amplifies. Note that this still does not make an assumption that we need to prove 100% safety, but rather, argues, for reasons, from assumptions that it will be hard to get any safety at all from loss functions which merely coincide to what we want somewhat well. I still think the list of lethalities is a pretty good reply to your overall line of reasoning -- IE it clearly flags that the problem is not achieving perfection, but rather, achieving any significant probability of safety, and it gives a bunch of concrete reasons why this is hard, IE provides arguments rather than some kind of blind assumption like you seem to be indicating. You are doing a reasonable thing by trying to provide some sort of argument for why these conclusions seem wrong, but "things tend to be easy when you lift the requirement of perfection" is just an extremely weak argument which seems to fall apart the moment we contemplate the specific case of AI alignment at all.

1quetzal_rainbow2y

The problem with RLHF/DPO is not that it doesn't work period, the problem is that we don't know if they work. I can imagine that we can just scale to superintelligence, apply RLHF and get aligned ASI, but this would imply a bunch of things about reality like "even at high level of capability reasonable RLHF-data contains overwhelmingly mostly good value-shaped thought-patterns" and I just don't think that we know enough about reality to make such statements.

4Noosphere892y

I think this might be a crux, actually. I think it's surprisingly common in history for things to work out well empirically, but that we either don't understand how they work, or it took a long time to understand how it works. AI development is the most central example, but I'd argue the invention of steel is another good example. To put it another way, I'm relying on the fact that there have been empirically successful interventions where we either simply don't know why it works, or it takes a long time to get a useful theory out of the empirically successful intervention.

0Remmelt2y

Are you mostly looking for where there is useful empirical feedback? That sounds like a shot in the dark. A concern I have: I cannot conceptually distinguish these continued empirical investigations of methods to build maybe-aligned AGI, from how medieval researchers tried to build perpetual motion machines. It took sound theory to finally disprove the possibility once and for all that perpetual motion machines were possible. I agree with Charbel-Raphaël that the push for mechanistic interpretability is in effect promoting the notion that there must be possibilities available here to control potentially very dangerous AIs to stay safe in deployment. It is much easier to spread the perception of safety, than to actually make such systems safe. That, while there is no sound theoretical basis for claiming that scaling mechanistic interpretability could form the basis of such a control method. Nor for that any control method could keep "AGI" safe. Rather, mechint is fundamentally limited in the extent it could be used to safely control AGI. See posts: 1. The limited upside of interpretability by Peter S. Park 2. Why mechanistic interpretability does not and cannot contribute to long-term AGI safety by me Besides theoretical limits, there are plenty of practical arguments (as listed in Charbel-Raphaël's post) for why scaling the utilisation of mechint would be net harmful. So no rigorous basis for that the use of mechint would "open up possibilities" to long-term safety. And plenty of possibilities for corporate marketers – to chime in on mechint's hypothetical big breakthroughs. In practice, we may help AI labs again – accidentally – to safety-wash their AI products.

8Richard_Ngo2y

It does seem like a large proportion of disagreements in this space can be explained by how hard people think alignment will be. It seems like your view is actually more pessimistic about the difficulty of alignment than Eliezer's, because he at least thinks it's possible for mechinterp to help in principle. I think that being confident in this level of pessimism is wildly miscalibrated, and such a big disagreement that it's probably not worth discussing much further. Though I reply indirectly to your point here.

1Remmelt2y

I personally think pessimistic vs. optimistic misframes it, because it frames a question about the world in terms of personal predispositions. I would like to see reasoning. Your reasoning in the comment thread you linked to is: “history is full of cases where people dramatically underestimated the growth of scientific knowledge, and its ability to solve big problems” That’s a broad reference-class analogy to use. I think it holds little to no weight as to whether there would be sufficient progress on the specific problem of “AGI” staying safe over the long-term. I wrote why that specifically would not be a solvable problem.

[-]Charbel-Raphaël7mo*Ω37808Review for 2023 Review

Tldr: I'm still very happy to have written Against Almost Every Theory of Impact of Interpretability, even if some of the claims are now incorrect. Overall, I have updated my view towards more feasibility and possible progress of the interpretability agenda — mainly because of the SAEs (even if I think some big problems remain with this approach, detailed below) and representation engineering techniques. However, I think the post remains good regarding the priorities the community should have.

First, I believe the post's general motivation of red-teaming a big, established research agenda remains crucial. It's too easy to say, "This research agenda will help," without critically assessing how. I appreciate the post's general energy in asserting that if we're in trouble or not making progress, we need to discuss it.

I still want everyone working on interpretability to read it and engage with its arguments.

Acknowledgments: Thanks to Epiphanie Gédéon, Fabien Roger, and Clément Dumas for helpful discussions.

Updates on my views

Legend:

On the left of the arrow, a citation from the OP → ❓ on the right, my review which generally begins with emojis
✅ - yes, I think I was correct (>90%)
❓✅ - I

... (read more)

[-]ryan_greenblatt2y*Ω276354

After spending a while thinking about interpretability, my current stance is:

Let's define Mechanistic interpretability as "A subfield of interpretability that uses bottom-up approaches, generally by corresponding low-level components such as circuits or neurons to components of human-understandable algorithms and then working upward to build an overall understanding."
I think mechanistic interpretability probably has to succeed very ambitiously to be useful.
Mechanistic interpretability seems to me to be very far from succeeding this ambitiously
Most people working on mechanistic interpretability don't seem to me like they're on a straightforward path to ambitious success, though I'm somewhat on board with the stuff that Anthropic's interp team is doing here.

Note that this is just for "mechanistic interpretability". I think that high level top down interpretability (both black box and white box) has a clearer story for usefulness which doesn't require very ambitious success.

[-]ryan_greenblatt2y*Ω153422

For mechanistic interpretabilty, very ambitious success looks something like:

Have some decomposition of the model or the behavior of the model into parts.
For any given randomly selected part, you should almost always be able build up a very good understanding of this part in isolation.
- By "very good" I mean that the understanding accounts for 90% of the bits of optimization applied to this part (where the remaining bits aren't predictably more or less important per bit than what you've understood).
- Roughly speaking, if your understanding accounts for 90% of the bits of optization for a AI than it means you should be able to construct a AI which works as well as if the original AI was only trained with 90% of the actual training compute.
- In terms of loss explained, this is probably very high, like well above 99%.
The length of the explanation of all parts is probably only up to 1000 times shorter in bits than the size of the model. So, for a 1 trillion parameter model it's at least 100 million words or 200,000 pages (assuming 10 bits per word). The compression comes from being able to use human concepts, but this will only get you so much.
Given your ability to explain any giv

... (read more)

[-]ryan_greenblatt2yΩ142621

The main reason why I think mechanistic interpretability is very far from ambitious success is that current numbers are extremely bad and what people explain is extremely cherry picked. Like people's explanations typically result in performance which is worse than that of much, much tinier models even though heavy cherry picking is applied.

If people were getting ok perf on randomly selected "parts" of models (for any notion of decomposition), then we'd be much closer. I'd think we were be much closer even if this was extremely labor intensive.

(E.g., the curve detectors work explained ~50% of the loss which is probably well less than 10% of the bits given sharply diminishing returns to scale on typical scaling laws.)

0[comment deleted]2y

[-]Rohin Shah2y*Ω213821

I think I would particularly critique DeepMind and OpenAI's interpretability works, as I don't see how this reduces risks more than other works that they could be doing, and I'd appreciate a written plan of what they expect to achieve.

I can't speak on behalf of Google DeepMind or even just the interpretability team (individual researchers have pretty different views), but I personally think of our interpretability work as primarily a bet on creating new affordances upon which new alignment techniques can be built, or existing alignment techniques can be enhanced. For example:

It is possible to automatically make and verify claims about what topics a model is internally "thinking about" when answering a question. This is integrated into debate, and allows debaters to critique each other's internal reasoning, not just the arguments they externally make.
- (It's unclear how much this buys you on top of cross-examination.)
It is possible to automatically identify "cruxes" for the model's outputs, making it easier for adversaries to design situations that flip the crux without flipping the overall correct decision.
- Redwood's adversarial training project is roughly in this category, where the

... (read more)

[-]Charbel-Raphaël2y121

To give props to your last paragraphs, you are right about my concern that most alignment work is less important than governance work. Most of the funding in AI safety goes to alignment, AI governance is comparatively neglected, and I'm not sure that's the best allocation of resources. I decided to write this post specifically on interpretability as a comparatively narrow target to train my writing.

I hope to work on a more constructive post, detailing constructive strategic considerations and suggesting areas of work and theories of impact that I think are most productive for reducing X-risks. I hope that such a post would be the ideal place for more constructive conversations, although I doubt that I am the best suited person to write it.

3Charbel-Raphaël2y

Makes sense, but I think this is not the optimal resource allocation. I explain why below: Yes, the pause is my secondary goal (edit: conditional on no significant alignment progress, otherwise smart scaling and regulations are my priorities). My primary goal remains coordination and safety culture. Mainly, I believe that one of the main pivotal processes goes through governance and coordination. A quote that explains my reasoning well is the following: * "That is why focusing on coordination is crucial! There is a level of coordination above which we don’t die - there is no such threshold for interpretability. We currently live in a world where coordination is way more valuable than interpretability techniques. So let’s not forget that non-alignment aspects of AI safety are key! AI alignment is only a subset of AI safety! (I’m planning to deep-dive more into this in a following post)." That's why I really appreciate Dan Hendryck's work on coordination. And I think DeepMind and OpenAI could make a huge contribution by doing technical work that is useful for governance. We've talked a bit during the EAG, and I understood that there's something like a numerus clausus in DeepMind's safety team. In that case, since interpretability doesn't require a lot of computing power/prestige and as DeepMind has a very high level of prestige, you should use it to write papers that help with coordination. Interpretability could be done outside the labs. For example, some of your works like Model evaluation for extreme risks, or Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals, are great for such purpose!

2MiguelDev2y

I agree with this perspective if we can afford the time to perform interpretability work on all of model setups - which our head count is too low to do that. Given the urgency to address the alignment challenge quickly, it's better to encourage (or even prioritize) conceptually sound interpretability work rather than speculative approaches.

[-]evhub2y*361

I think that this is a well-done post overall, though I mostly disagree with it. A couple of thoughts below.

First, I was surprised not to see unknown unknowns addressed, as Richard pointed out.

Second, another theory of impact that I didn't see addressed here is the case that I've been trying to make recently that interpretability is likely to be necessary to build good safety evaluations. This could be quite important if evaluations end up being the primary AI governance tool, as currently looks somewhat likely to me.

Third, though you quote me talking about why I think detecting/disincentivizing deception with interpretability tools is so hard, what is not quoted is what I think about the various non-interpretability methods of doing so—and what I think there is that they're even harder. Though you mention a bunch of non-interpretability ways of studying deception (which I'm definitely all for), studying it doesn't imply that we can disincentivize it (and I think we're going to need both). You mention chain-of-thought oversight as a possible solution, but I'm quite skeptical of that working, simply because the model need not write out its deception in the scratchpad in any legible ... (read more)

[-]leogao2y3616

My personal theory of impact for doing nonzero amounts of interpretability is that I think understanding how models think will be extremely useful for conceptual research. For instance, I think one very important data point for thinking about deceptive alignment is that current models are probably not deceptively aligned. Many people have differing explanations for which property of the current setup causes this (and therefore which things we want to keep around / whether to expect phase transitions / etc), which often imply very different alignment plans. I think just getting a sense of what even these models are implementing internally could help a lot with deconfusion here. I don't think it's strictly necessary to do interpretability as opposed to targeted experiments where we observe external behaviour for these kinds of things, but probably experiments that get many bits are much better than targeted experiments for deconfusion, because oftentimes the hypotheses are all wrong in subtle ways. Aside from that, I am not optimistic about fully understanding the model, training against interpretability, microscope AI, or finding the "deception neuron" as a way to audit deception. I don't think future models will necessarily have internal structures analogous to current models.

2Charbel-Raphaël2y

I completely agree that past interp research has been useful for my understanding of deep learning. But we are funding constrained. The question now is "what is the marginal benefit of one hour of interp research compared to other types of research", and "whether we should continue to prioritize it given our current understanding and the lessons we have learned".

6leogao2y

I agree that people who could do either good interpretability or conceptual work should focus on conceptual work. Also, to be clear the rest of this comment is not necessarily a defence of doing interpretability work in particular, but a response to the specific kind of mental model of research you're describing. I think it's important that research effort is not fungible. Interpretability has a pretty big advantage that unlike conceptual work, a) it has tight feedback loops, b) is much more paradigmatic, c) is much easier to get into for people with an ML research background. Plausibly the most taut constraint in research is not strictly the number of researchers you can fund/train to solve a given problem--it's hard to get researchers to do good work if they don't feel intellectually excited about the problem, which in turn is less likely if they feel like they're never making any progress, or feel like they are constantly unsure about what problem they're even trying to solve. To be clear I am not arguing that we should focus on things that are easier to solve--I am very much in favor of not just doing things that are easy to do but actually don't help ("looking under the streetlamp"). Rather, I think what we should be doing is finding things that actually matter and making it easier for people to get excited about it (and people who are able to do this kind of work have a huge comparative advantage here!).

2Richard_Ngo2y

This seems like a false dichotomy; in general I expect that the best conceptual work will be done in close conjunction with interpretability work or other empirical work. (In general I think that almost all attempts to do "conceptual" work that doesn't involve either empirical results or proofs is pretty doomed. I'd be interested in any counterexamples you've seen; my main counterexample is threat modeling, which is why I've been focusing a lot on that lately.) EDIT: many downvotes, no counterexamples. Please provide some.

3leogao2y

I agree that doing conceptual work in conjunction with empirical work is good. I don't know if I agree that pure conceptual work is completely doomed but I'm at least sympathetic. However, I think my point still stands: I think someone who can do conceptual+empirical work will probably have more impact doing that than not thinking about the conceptual side and just working really hard on conceptual work. 1. They may find some other avenue of empirical work that can help with alignment. I think probably there exist empirical avenues substantially more valuable for alignment than making progress on interpretability and opening those up requires thinking about the conceptual side. 2. Even if they think hard about it and can't think of anything better than conceptual+interpretability, it still seems better for an interpretability researcher to have an idea of how their work will fit into the broader picture. Even if they aren't backchaining, this still seems more useful than just randomly doing something under the heading of interpretability.

4Richard_Ngo2y

(I assume that the last "conceptual" should be "empirical".) I agree that not thinking about the conceptual side is bad. But that's standard for science. Like, top scientists in almost any domain aren't just thinking about their day-to-day empirical research, they have broader opinions about the field as a whole, and more speculative and philosophical ideas, and so on. The difference is whether they treat those ideas as outputs in their own right, versus as inputs that feed into some empirical or theoretical output. Most scientists do the latter; when people in alignment talk about "conceptual work" my impression is that they're typically thinking about the former.

1Andrew McKnight9mo

Do you think putting extra effort into learning about existing empirical work while doing conceptual work would be sufficient for good conceptual work or do you think people need to be producing empirical work themselves to really make progress conceptually?

5Richard_Ngo9mo

The former can be sufficient—e.g. there are good theoretical researchers who have never done empirical work themselves. In hindsight I think "close conjunction" was too strong—it's more about picking up the ontologies and key insights from empirical work, which can be possible without following it very closely.

[-]CallumMcDougall2y*201

(context: I ran the most recent iteration of ARENA, and after this I joined Neel Nanda's mech interp stream in SERI MATS)

Registering a strong pushback to the comment on ARENA. The primary purpose of capstone projects isn't to turn people into AI safety technical researchers or to produce impressive capstones, it's to give people engineering skills & experience working on group projects. The initial idea was not to even push for things that were safety-specific (much like Redwood's recommendations - all of the suggested MLAB2 capstones were either mech interp or non-safety, iirc). The reason many people gravitated towards mech interp is that they spent a lot of time around researchers and people who were doing interesting work in mech interp, and it seemed like a good fit for both getting a feel for AI safety technical research and for general skilling up in engineering.

Additionally, I want to mention that participant responses to the question "how have your views on AI safety changed?" included both positive and negative updates on mech interp, but much more uniformly showed positive updates on AI safety technical research as a whole. Evidence like this updates me away from the... (read more)

[-]dr_s2y118

The biggest thing that worries me about the idea of interpretability, which you mention, is that any sufficiently low-level interpretation of a giant, intractably complex AGI-level model would likely be also intractably complex. And any interpretation of that. And so on so forth, until you start getting the feel that you'll probably need AI to interpret the interpretation, and then AI to interpret the interpreter, and so on in a chain which you might try to carefully validate but that increasingly feels like a typical Godzilla Strategy. This does not lead to rising property values in Tokyo.

That said, maybe it can be done, and even be reliable enough. But it would also enhance significantly our ability to distil models. Like, if you could take a NN-based model, interpret it, and map it to a GOFAI-style extremely interpretable system, now you probably have a much faster, leaner and cleaner version of the same AI - so you can probably just build an even bigger AI. And the question then becomes if this style of interpretability can ever catch up to the increase in capabilities it would automatically foster.

[-]Daniel Murfet2y116

Induction heads? Ok, we are maybe on track to retro engineer the mechanism of regex in LLMs. Cool.

This dramatically undersells the potential impact of Olsson et al. You can't dismiss modus ponens as "just regex". That's the heart of logic!

For many the argument for AI safety being a urgent concern involves a belief that current systems are, in some rough sense, reasoning, and that this capability will increase with scale, leading to beyond human-level intelligence within a timespan of decades. Many smart outsiders remain sceptical, because they are not convinced that anything like reasoning is taking place.

I view Olsson et al as nontrivial evidence for the emergence of internal computations resembling reasoning, with increasing scale. That's profound. If that case is made stronger over time by interpretability (as I expect it to be) the scientific, philosophical and societal impact will be immense.

[-]Stephen McAleese2y10-9

In my opinion, much of the value of interpretability is not related to AI alignment but to AI capabilities evaluations instead.

For example, the Othello paper shows that a transformer trained on the next-word prediction of Othello moves learns a world model of the board rather than just statistics of the training text. This knowledge is useful because it suggests that transformer language models are more capable than they might initially seem.

[-]Jan Betley2y70

Very good post! I agree with most of what you have written, but I'm not sure about the conclusions. Two main reasons:

I'm not sure if mech interp should be compared to astronomy, I'd say it is more like mechanical engineering. We have JWST because long long time ago there were watchmakers, gunsmiths, opticans etc who didn't care at all about astronomy, yet their advances in unrelated fields made astronomy possible. I think something similar might happen with mech interp - we'll keep creating better and better tools to achieve some goals, these goals will

... (read more)

[-]MiguelDev2y7-7

Fully agree with the post. Depending solely on interpretability work and downloading activations without understanding how to interpret the numbers is a big waste of time. Met smart people stuck in aimless exploration; bad in the long run. Wasting time slowly is not immediately painful, but it really hurts when projects fail due to poor direction.

[-]Cole Wyeth2y60

I roughly agree with the case made here because I expect interpretability research to be much, much harder than others seem to appreciate. This is a consequence of strong intuitions from working on circuit complexity. Figuring out the behavior of a general circuit sounds like it's in a very hard complexity class - even writing down the truth table for a circuit takes exponential time in the number of inputs! I would be surprised if coming up with a human interpretable explanation of sub circuits is easy; there are some reasons to believe that SGD wil... (read more)

[-]Quadratic Reciprocity2y60

see the current plan here EAG 2023 Bay Area The current alignment plan, and how we might improve it

Link to talk above doesn't seem to work for me.

Outside view: The proportion of junior researchers doing interp rather than other technical work is too high

Quite tangential^[1] to your post but if true, I'm curious about what this suggests about the dynamics of field-building in AI safety.

Seems to me like certain organisations and individuals have an outsized influence in funneling new entrants into specific areas, and because the field is small (and ... (read more)

[-]Daniel Murfet7mo50Review for 2023 Review

I have been thinking about interpretability for neural networks seriously since mid-2023. The biggest early influences on me that I recall were Olah's writings and a podcast that Nanda did. The third most important is perhaps this post, which I valued as an opposing opinion to help sharpen up my views.

I'm not sure it has aged well, in the sense that it's no longer clear to me I would direct someone to read this in 2025. I disagree with many of the object level claims. However, especially when some of the core mechanistic interpretability work is not being subjected to peer review, perhaps I wish there was more sceptical writing like this on balance.

[-]abramdemski2yΩ358

I finally got around to reading this today, because I have been thinking about doing more interpretability work, so I wanted to give this piece a chance to talk me out of it.

It mostly didn't.

A lot of this boils down to "existing interpretability work is unimpressive". I think this is an important point, and significant sub-points were raised to argue it. However, it says little 'against almost every theory of impact of interpretability'. We can just do better work.
A lot of the rest boils down to "enumerative safety is dumb". I agree, at least for the

... (read more)

[-]Charlie Steiner2yΩ350

I broadly agree, but I think there's more safety research along with "Retarget the search" that focuses on using a trained AI's own internals to understand things like deception, planning, preferences, etc, that you didn't mention. You did say this sort of thing isn't a central example of "interpretability," which I agree with, but some more typical sorts of interpretability can be clear instrumental goals for this.

E.g. suppose you want to use an AI's model of human preferences for some reason. To operationalize this, given a description of a situation, yo... (read more)

[-]Algon2y53

I thought the section on interpretability as a tool to predict future systems was poor. The posts arguments against that theory of impact are: reading current papers is a better predictor of future capabilities than current interpretability work & examples of interpretability being applied after phenomenon are discovered. But no one is saying current interpretability tech & insights will let you predict the future! As you point out, we barely even understand what a feature is!

Which could change. If we advance enough to reverse engineer GPT-4, and f... (read more)

[-]Review Bot1y*30

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

[-]Mateusz Bagiński2y31

A feature is still a fuzzy concept,

"Gene", "species", and even "concept" are also fuzzy concepts but despite that, we managed to substantially improve our understanding of the-things-in-the-world-they-point-to and the phenomena they interact with. Using these fuzzy concepts even made us realize how fuzzy they are, what's the nature of their fuzziness, and what other (more natural/appropriate/useful/reality-at-joint-carving) abstractions we may replace them with.^[1] In other words, we can use fuzzy concepts as a ladder/provisional scaffold for understa... (read more)

[-]Keenan Pepper2y30

Some of your YouTube links are broken because the equals sign got escaped as "%3D". If I were you I'd spend a minute to fix that.

[-]Joseph Bloom2y31

Strong disagree. Can’t say I’ve worked through the entire article in detail but wanted to chime in as one of the many of junior researchers investing energy in interpretability. Noting that you erred on the side of making arguments too strong. I agree with Richard about this being the wrong kind of reasoning for novel scientific research and with Rohin’s idea that we’re creating new affordances. I think generally MI is grounded and much closer to being a natural science that will progress over time and be useful for alignment, synergising with other approa... (read more)

2cozyfractal2y

I'm not sure of what you meant about studying transistors. It seems to me to me that if we are studying transistors so hard, it's to push computers capabilities (faster, smaller, more energy efficient etc.), and not at all to make software safer. Instead to make software safer, we use anti-viruses, automatic testing, developer liability, standards, regulations, pop-up warnings, etc.

[-]Sheikh Abdur Raheem Ali3mo10

Consider updating "Auditing Deception for Interp is out of reach" with a link to https://www.lesswrong.com/posts/PwnadG4BFjaER3MGf/interpretability-will-not-reliably-find-deceptive-ai

[-]Thanh Do6mo10

But I’ve never seen anybody use it in industry

Disagree - i used to work for some startup in computer vision, and it's pretty common for us to visualize and manually change the hyperparams - yes, we manually changed it

[-]ojorgensen2y10

One central criticism of this post is its pessimism towards enumerative safety. (i.e. finding all features in the model, or at least all important features). I would be interested to hear how the author / others have updated on the potential of enumerative safety in light of recent progress on dictionary learning, and finding features which appear to correspond to high-level concepts like truth, utility and sycophancy. It seems clear that there should be some positive update here, but I would be interested in understanding issues which these approaches wil... (read more)

[-]The Non-Economist2y10

Generally lots of value-add discussions but there are some gaps I want to fill some gaps on potentially biased PoVs.

Starting with Value-Adds:

1) It's great to point out how interpretability (currently doesn't) solve real life problems and types of problems it won't solve.

2) Covering views on warning against the dangers of interpretability

3) Interpretability most of the times is unnecessary...

Filling in the gaps

1) There's a clear difference btw pre-deployment vs post-deployment interpretability. Post-deployment interpretability is dangerous. Pre-deploy... (read more)

[-]Hamish Doodles2y1-3

The proportion of junior researchers doing interp rather than other technical work is too high

I think that's because it's almost the only thing that junior researchers can productively work on.

Even if mech interp isn't in itself useful I'd guess it's pretty useful as a souce of endless puzzles to help people skill up in doing technical ML work.

3Charbel-Raphaël2y

I disagree. I have seen plenty of young researchers being unproductive doing interp. Writing code does not necessarily mean being productive. There are a dozen different streams in seri mats, and interp is only one of them. I don't quite understand how you can be so sure that Interp is the only way to level up.

[-]Rudi C2y*1-18

This post has good arguments, but it mixes in a heavy dose of religious evangelism and narcissism which retracts from its value.

The post can be less controversial and “culty” if it drops its second-order effect speculations, its value judgements, and it just presents a case that focusing on other technical areas of safety research is underrepresented. Focusing on non-technical work needs to be a whole other post, as it’s completely unrelated to interp.

[-]muggingblaise2y*10

Emulating GPT-4 using LLMs like GPT-3 as different submodules that send messages written in plain English to each other before outputting the next token. If the neural network had deceptive thoughts, we could see them in these intermediate messages.

This doesn't account for the possibility that there's still stenography involved. Plain English coming from an LLM may not be so plain given

33. Alien Concepts: “The AI does not think like you do” There may not necessarily be a humanly understandable explanation for cognition done by crunchin

... (read more)