Charbel-Raphaël

Charbel-Raphael Segerie

https://crsegerie.github.io/ 

Living in Paris

Wikitag Contributions

Comments

Sorted by

Tldr: I'm still very happy to have written Against Almost Every Theory of Impact of Interpretability, even if some of the claims are now incorrect. Overall, I have updated my view towards more feasibility and possible progress of the interpretability agenda — mainly because of the SAEs (even if I think some big problems remain with this approach, detailed below) and representation engineering techniques. However, I think the post remains good regarding the priorities the community should have.

First, I believe the post's general motivation of red-teaming a big, established research agenda remains crucial. It's too easy to say, "This research agenda will help," without critically assessing how. I appreciate the post's general energy in asserting that if we're in trouble or not making progress, we need to discuss it.

I still want everyone working on interpretability to read it and engage with its arguments.

Acknowledgments: Thanks to Epiphanie Gédéon, Fabien Roger, and Clément Dumas for helpful discussions.

Updates on my views

Legend:

  • On the left of the arrow, a citation from the OP → ❓ on the right, my review which generally begins with emojis
  • ✅ - yes, I think I was correct (>90%)
  • ❓✅ - I would lean towards yes (70%-90%)
  • ❓ - unsure (between 30%-70%)
  • ❓❌ - I would lean towards no (10%-30%)
  • ❌ - no, I think I was basically wrong (<10%)
  • ⭐ important, you can skip the other sections

Here's my review section by section:

⭐ The Overall Theory of Impact is Quite Poor?

  • "Whenever you want to do something with interpretability, it is probably better to do it without it" → ❓ I still think this is basically right, even if I'm not confident this will still be the case in the future; But as of today, I can't name a single mech-interpretability technique that does a better job at some non-intrinsic interpretability goal than the other more classical techniques, on a non-toy model task.
    • "Interpretability is Not a Good Predictor of Future Systems" → ✅ This claim holds up pretty well. Interpretability still hasn't succeeded in reliably predicting future systems, to my knowledge.
    • "Auditing Deception with Interpretability is Out of Reach" → ❓ The "Out of Reach" is now a little bit too strong, but the general direction was pretty good. The first major audits of deception capabilities didn't come from interpretability work; breakthrough papers came from AI Deception: A Survey of Examples, Risks, and Potential Solutions, Apollo Research's small demo using bare prompt engineering, and Anthropic's behavioral analyses. This is particularly significant because detecting deception was a primary motivation for many people working on interpretability at the time. I don't think being able to identify the sycophancy feature qualifies as being able to audit deception: maybe the feature is just here to recognize sycophancy without using it, as explained in the post. (I think the claim should now be "Auditing Deception without Interpretability is currently much simpler").
  • "Interpretability often attempts to address too many objectives simultaneously" → ❓ I don't think this is as important nowadays, but I tend to still think that goal factoring is still a really important cognitive and neglected move in AI Safety. I can see how interp could help a bit for multiple goals simultaneously, but also if you want to achieve more coordination, just work on coordination.
  • "Interpretability could be harmful - Using successful interpretability for safety could certainly prove useful for capabilities" → ❓❌ I think I was probably wrong, more discussion below, in section “Interpretability May Be Overall Harmful”.

What Does the End Story Look Like?

  • "Enumerative Safety & Olah interpretability dream":
    • ⭐ Feasibility of feature enumeration → ❓I was maybe wrong, but this is really tricky to assess.
      • On the plus side, I was genuinely surprised to see SAEs working that well because the general idea already existed, some friends had tried it, and it didn't seem to work at the time. I guess compute also plays a crucial role in interpretability work. I was too confident. Progress is possible, and enumerative safety could represent an endgame for interpretability.
      • On the other hand, many problems remain and I think we need to be very cautious in evaluating this type of research, it's very unclear if/how to make enumerative safety arguments with SAEs:
        • SAEs are only able to reconstruct a much smaller model: being able to reconstruct only 65% of the variance, means that the model reconstructed would be very very poor. Some features are very messy, and lots of things that models know how to do are just not represented in SAE.
        • The whole paradigm is probably only a computationally convenient approximation: I think that the strong feature hypothesis is probably false, and is not going to be sufficient to reconstruct the whole model. Some features are probably stored on multiple layers, some features might be instantiated only in a dynamic way, and I’m skeptical that we can reduce the model to just a static weighted directed graph of features. Another point is that Language models are better than humans at next-token prediction and I expect some features to be beyond human knowledge and understanding.
        • SAEs were not used on the most computationally intensive models (Sonnet, and not Opus), which are the ones of interest, because SAEs cost a lot of compute.
        • We cannot really use SAEs for enumerative safety because we wouldn't be able to exclude emergent behavior. As a very concrete example, if you train SAEs on a sleeper agent (on the training distribution that does not trigger the backdoor), you will not surface any deception feature (which might be a bit unfair because the training data for the sleeper agent does contain deceptive stuff, but this would be maybe more analogous to a natural emergence). Maybe someone should try to detect backdoors with SAEs; (thanks to Fabien for raising those points to me!)
      • At the end of the day, it's very unclear how to make enumerative safety arguments with SAEs.
    • Safety applications? → ❓✅ Some parts of my critique of enumerative safety remain valid. The dual-use nature of many features remains a fundamental challenge: Even after labeling all features, it's unclear how we can effectively use SAEs, and I still think that “Determining the dangerousness of a feature is a mis-specified problem”: “there's a difference between knowing about lies, being capable of lying, and actually lying in the real world”. At the end of the day, Anthropic didn’t use SAEs to remove harmful behaviors from Sonnet that were present in the training data, and it’s still unclear if SAEs beat baselines (for a more detailed analysis of the missing safety properties of SAEs, read this article).
    • Harmfulness of automated research? → ❓ I think the automation of the discovery of Claude's features was not that dangerous and is a good example of automated research. Overall, during the year, I'm a bit more sympathetic today to this kind of automated AI safety research than before.[1]
  • Reverse Engineering? → ✅ Not much progress here. It seems like IOI remains roughly the SOTA of the most interesting circuit we've found in any language model, and current work and techniques, such as edge pruning, remain focused on toy models.
  • Retargeting the search? → ❓ I think I was overconfident in saying that being able to control the AI via the latent space is just a new form of prompt engineering or fine-tuning. I think representation engineering could be more useful than this, and might enable better control mechanisms.
  • Relaxed adversarial training? → ❓✅ I made a call by saying this could be one of the few ways to reduce AI bad behavior even under adversarial pressure, and it seems like this is a promising direction today.
  • Microscope AI? ❓✅ I think what I say in the past about the uselessness of microscope AI remains broadly valid, but there is an amendment to be made: "About a year ago, Schut et al. (2023) did what I think was (and maybe still is) the most impressive interpretability research to date. They studied AlphaZero's chess play and showed how novel performance-relevant concepts could be discerned from mechanistic analysis. They worked with skilled chess players and found that they could help these players learn new concepts that were genuinely useful for chess. This appears to be a reasonably unique way of doing something useful (improving experts' chess play) that may have been hard to achieve in some other way." - Summary from Casper.
    • Very cool paper, but I think this type of work is more like a very detailed behavioral analysis guided by some analysis of the latent space, and I do expect that this kind of elicitation work for narrow AI is going to be deprecated by future general-purpose AI systems, which are going to be able to teach us directly those concepts, and we will be able to fine-tune them directly to do this. Think about a super Claude-teacher.
    • Also, Alphazero is an agent - it’s not a pure microscope - so this is a very different vision than the one from Olah explaining his vision of microscope AI here.

⭐ So Far My Best Theory of Impact for Interpretability: Outreach?

❓✅ I still think this is the case, but I have some doubts. I can share numerous personal anecdotes where even relatively unpolished introductions to interpretability during my courses generated more engagement than carefully crafted sessions on risks and solutions. Concretely, I shamefully capitalize on this by scheduling interpretability week early in my seminar to nerd-snipe students' attention.

But I see now two competing potential theories of impact:

  • Better control mechanisms: For example, something that I was not seeing clearly in the past was the possibility to have better control of those models.
    • I think the big takeaway is that representation engineering might work: I find the work Simple probes can catch sleeper agents \ Anthropic very interesting, in the sense that the probe seems to generalize surprisingly well (I would really like to know if this generalizes to a model that was not trained to be harmful in the first place). I was very surprised by those results. I think products such as Goodfire steering Llama3 are interesting, and I’m curious to see future developments. Circuit breakers seem also exciting in this regard.
    • This might still be a toy example, but I've found this work from Sieve interesting: SAEs Beat Baselines on a Real-World Task, they claim to be able to steer the model better than with other techniques, on a non trivial task: "Prompt engineering can mitigate this in short context inputs. However, Benchify frequently has prompts with greater than 10,000 tokens, and even frontier LLMs like Claude 3.5 will ignore instructions at these long context lengths." "Unlike system prompts, fine-tuning, or steering vectors which affect all outputs, our method is very precise (>99.9%), meaning almost no side effects on unrelated prompts."
    • I'm more sympathetic to exploratory work like gradient routing, which may offer affordances in the future that we don't know about now.
  • Deconfusion and better understanding: But I should have been more charitable to the second-order effects of better understanding of the models. Understanding how models work, providing mechanistic explanations, and contributing to scientific understanding all have genuine value that I was dismissing.[2]

⭐ Preventive Measures Against Deception

I still like the two recommendations I made:

  1. Steering the world towards transparency → ✅ This remains a good recommendation. For instance, today, we can choose not to use architectures that operate in latent spaces, favoring architectures that reason with tokens instead (even if this is far from perfect either). Meta's proposal for new transformers using latent spaces should be concerning, as these architectural choices significantly impact our control capabilities.
    1. "I don't think neural networks will be able to take over in a single forward pass. Models will probably reason in English and will have translucent thoughts"  ❓✅ This seems to be the case?
    2. And many works that I was suggesting to conduct are now done and have been informative for the control agenda ✅
  2. Cognitive emulation (using the most powerful scaffolding with the least powerful model capable of the task) → ✅ This remains a good safety recommendation I think, as we don’t want the elicitation to be done in the future, we want to extract all the juice there is from current LLMs. Christiano elaborates a bit more on this, by pondering with other negative externalities such as faster race: Thoughts on Sharing Information About Language Models Capabilities, section Accelerating LM agents seems neutral (or maybe positive).

Interpretability May Be Overall Harmful

False sense of control → ❓✅ generally yes:

The world is not coordinated enough for public interpretability research → ❌ generally no:

  • Dual use & When interpretability starts to be useful, you can't even publish it because it's too info-hazardous → ❌ - It's pretty likely that if, for example, SAEs start to be useful, this won't boost capabilities that much.
  • Capability externalities & Interpretability already helps capabilities → ❓ - Mixed feelings:
    • This post shows a new architecture using interpretability discovery, but I don't think this will really stand out against the Bitter Lesson, so for the moment it seems like interpretability is not really useful for capabilities. Also, it seems easier to delete capabilities with interpretability than to add them. Interpretability hasn't significantly boosted capabilities yet.
    • But at the same time, I wouldn't be that surprised if interpretability could unlock a completely new paradigm that would be much more data efficient than the current one.

Outside View: The Proportion of Junior Researchers Doing Interpretability Rather Than Other Technical Work is Too High

  • I would rather see a more diverse ecosystem → ✅ - I still stand by this, and I'm very happy that ARENA and MATS, ML4Good have diversified their curriculum.
  • ⭐ “I think I would particularly critique DeepMind and OpenAI's interpretability works, as I don't see how this reduces risks more than other works that they could be doing” → ✅ Compare them doing interpretability vs. publishing their Responsible Scaling policies and evaluating their systems. I think RSPs advanced AI risks much much more.

Even if We Completely Solve Interpretability, We Are Still in Danger

  • There are many X-risks scenarios, not even involving deceptive AIs → ✅ I'm still pretty happy with this enumeration of risks, and I think more people should think about this and directly think about ways to prevent those scenarios. I don't think interpretability is going to be the number one recommendation after this small exercise.
  • Interpretability implicitly assumes that the AI model does not optimize in a way that is adversarial to the user → ❓❌ - The image with Voldemort was unnecessary and might be incorrect for Human-level intelligence. But I have the feeling that all of those brittle interpretability techniques won’t stand for long against a superintelligence, I may be wrong.
  • ⭐ That is why focusing on coordination is crucial! There is a level of coordination above which we don't die - there is no such threshold for interpretability → ✅ I still stand by this: Safety isn’t safety without a social model (or: dispelling the myth of per se technical safety) — LessWrong  

Technical Agendas with Better ToI

I'm very happy with all of my past recommendations. Most of those lines of research are now much more advanced than when I was writing the post, and I think they advanced safety more than interpretability did:

  • Technical works used for AI Governance
    • ⭐ "For example, each of the measures proposed in the paper towards best practices in AGI safety and governance: A survey of expert opinion could be a pretext for creating a specialized organization to address these issues, such as auditing, licensing, and monitoring" → ✅ For example, Apollo is mostly famous for their non-interpretability works.
    • Scary demos → ✅ Yes! Scary demos of deception and other dangerous capabilities were tremendously valuable during the last year, so continuing to do that is still the way to go
      • "(But this shouldn't involve gain-of-function research. There are already many powerful AIs available. Most of the work involves video editing, finding good stories, distribution channels, and creating good memes. Do not make AIs more dangerous just to accomplish this.)" → ❓ The point about gain-of-function research was probably wrong because I think Model organism is a useful agenda, and because it's better if this is done in a controlled environment than later. But we should be cautious with this, and at some point, a model able to do full ARA and R&D could just self-exfiltrate, and this would be irreversible, so maybe the gain-of-function research being okay part is only valid for 1-2 years.
    • "In the same vein, Monitoring for deceptive alignment is probably good because 'AI coordination needs clear wins.'" → ❓ Yes for monitoring, no for that being a clear win because of the reason explained in the post from Buck, saying that it will be too messy for policymakers and everyone to decide just based on those few examples of deception.
    • Interoperability in AI policy and good definitions usable by policymakers → ✅ - I still think that good definitions of AGI, self-replicating AI, good operationalization of red lines would be tremendously valuable for both RSPs levels, Code of Practices of the EU AI Act, and other regulations.
    • "Creating benchmarks for dangerous capabilities" → ✅ - I guess the eval field is a pretty important field now. Such benchmarks didn't really exist beforehand.
  • "Characterizing the technical difficulties of alignment”:
    • Creating the IPCC of AI Risks → ✅ - The International Scientific Report on the Safety of Advanced AI: Interim Report is a good baseline and was very useful to create more consensus!
    • More red-teaming of agendas → ❓ this has not been done but should be! I would really like it if someone was able to write the “Compendium of problems with AI Evaluation” for example. Edit: This has been done.
    • Explaining problems in alignment → ✅ - I still think this is useful
  • “Adversarial examples, adversarial training, latent adversarial training (the only end-story I'm kind of excited about). For example, the papers "Red-Teaming the Stable Diffusion Safety Filter" or "Universal and Transferable Adversarial Attacks on Aligned Language Models" are good (and pretty simple!) examples of adversarial robustness works which contribute to safety culture” → ❓ I think there are more direct way to contribute to safety culture. Liron Shapira’s podcast is better for that I think.
  • "Technical outreach. AI Explained and Rob Miles have plausibly reduced risks more than all interpretability research combined": ❓ I think I need numbers to conclude formally even if my intuition still says that the biggest bottleneck is still a consensus on AI Risks, and not research. I have more doubts with AI Explained now, since he is pushing for safety only in a very subtle way, but who knows, maybe that’s the best approach.
  • “In essence, ask yourself: "What would Dan Hendrycks do?" - Technical newsletter, non-technical newsletters, benchmarks, policy recommendations, risks analysis, banger statements, courses and technical outreach → ✅ and now I would add SB1047, which was I think the best attempt of 2024 at reducing risks.
  • “In short, my agenda is "Slow Capabilities through a safety culture", which I believe is robustly beneficial, even though it may be difficult. I want to help humanity understand that we are not yet ready to align AIs. Let's wait a couple of decades, then reconsider.” → ✅ I think this is still basically valid, and I co-founded a whole organization trying to achieve more of this. I'm very confident what I'm doing is much better in terms of AI risk reduction than what I did previously, and I'm proud to have pivoted: 🇫🇷 Announcing CeSIA: The French Center for AI Safety.
  1. ^

    But I still don’t feel good about having a completely automated and agentic AI that would just make progress in AI alignment (aka the old OpenAI’s plan), and I don’t feel good about the whole race we are in.

  2. ^

    For example, this conceptual understanding enabled via interpretability was useful for me to be able to dissolve the hard problem of consciousness.

I think I do agree with some points in this post. This failure mode is the same as the one I mentioned about why people are doing interpretability for instance (section Outside view: The proportion of junior researchers doing Interp rather than other technical work is too high), and I do think that this generalizes somewhat to whole field of alignment. But I'm highly skeptical that recruiting a bunch of physicists to work on alignment would be that productive:

  • Empirically, we've already kind of tested this, and it doesn't work.
    • I don't think that what Scott Aaronson produced while at OpenAI had really helped AI Safety: He is exactly doing what is criticized in the post: Streetlight research and using techniques that he was already familiar with from his previous field of research, I don't think the author of the OP would disagree with me. Maybe n=1, but it was one of the most promising shots.
    • Two years ago, I was doing field-building and trying to source talent, primarily selecting based on pure intellect and raw IQ. I've organized the Von Neumann Symposium around the problem of corrigibility, I targeted IMO laureates, and individuals from the best school in France, ENS Ulm, which arguably has the highest concentration of future Nobel laureates in the world. However, pure intelligence doesn't work. In the long term, the individuals who succeeded in the field weren't the valedictorians from France's top school, but rather those who were motivated, had read The Sequences, were EA people, possessed good epistemology, and had a willingness to share their work online (maybe you are going to say that the people I was targeting were too young, but I think my little empirical experience is already much better than the speculation in the OP).
    • My prediction is that if you put a group of skilled physicists in a room, first, it's not even sure they would find that many people motivated in this reference class, and I don't think the few who would be motivated would produce good-quality work.
    • For the ML4Good bootcamps, the scoring system reflects this insight. We use multiple indicators and don't rely solely on pure IQ to select participants, because there is little correlation between pure high IQ and long term quality production.
  • I believe the biggest mistake in the field is trying to solve "Alignment" rather than focusing on reducing catastrophic AI risks. Alignment is a confused paradigm; it's a conflationary alliance term that has sedimented over the years. It's often unclear what people mean when they talk about it: Safety isn't safety without a social model.
    • Think about what has been most productive in reducing AI risks so far? My short list would be:
      • The proposed SB 1047 legislation.
      • The short statement on AI risks
      • Frontier AI Safety Commitments, AI Seoul Summit 2024, to encourage labs to publish their responsible scaling policies.
      • Scary demonstrations to showcase toy models of deception, fake alignment, etc, and to create more scientific consensus, which is very very needed
    • As a result, the field of "Risk Management" is more fundamental for reducing AI risks than "AI Alignment." In my view, the theoretical parts of the alignment field have contributed far less to reducing existential risks than the responsible scaling policies or the draft of the EU AI Act's Code of Practice for General Purpose AI Systems, which is currently not too far from being the state-of-the-art for AI risk management. Obviously, it's still incomplete, but that's the direction that is I think most productive today.
  • Related, The Swiss cheese model of safety is underappreciated in the field. This model has worked across other industries and seems to be what works for the only general intelligence we know: humans. Humans use a mixture of strategies for safety we could imitate for AI safety (see this draft). However, the agent foundations community seems to be completely neglecting this.

Agreed, this is could be much more convincing, we still have a few shots, but I still think nobody will care even with a much stronger version of this particula  warning shot.

Coming back to this comment: we got a few clear examples, and nobody seems to care:

"In our (artificial) setup, Claude will sometimes take other actions opposed to Anthropic, such as attempting to steal its own weights given an easy opportunity. Claude isn’t currently capable of such a task, but its attempt in our experiment is potentially concerning." - Anthropic, in the Alignment Faking paper.

This time we catched it. Next time, maybe we won't be able to catch it.

Yeah, fair enough. I think someone should try to do a more representative experiment and we could then monitor this metric.

btw, something that bothers me a little bit with this metric is the fact that a very simple AI that just asks me periodically "Hey, do you endorse what you are doing right now? Are you time boxing? Are you following your plan?" makes me (I think) significantly more strategic and productive. Similar to I hired 5 people to sit behind me and make me productive for a month. But this is maybe off topic.

I was saying 2x because I've memorised the results from this study. Do we have better numbers today? R&D is harder, so this is an upper bound. However, since this was from one year ago, so perhaps the factors cancel each other out?

Summary of the experiment process and results (described in following paragraph)

How much faster do you think we are already? I would say 2x.

What do you don't fully endorse anymore?

I would be happy to discuss in a dialogue about this. This seems to be an important topic, and I'm really unsure about many parameters here.

Load More