I think I do agree with some points in this post. This failure mode is the same as the one I mentioned about why people are doing interpretability for instance (section Outside view: The proportion of junior researchers doing Interp rather than other technical work is too high), and I do think that this generalizes somewhat to whole field of alignment. But I'm highly skeptical that recruiting a bunch of physicists to work on alignment would be that productive:
I think, in the policy world, perplexity will never be fashionable.
Training compute maybe, but if so, how to ban llama3? This is already too late
If so, the only policy that is see is red lines at full ARA.
And we need to pray that this is sufficient, and that the buffer between ara and takeover is sufficient. I think it is.
Indeed. We are in trouble, and there is no plan as of today. We are soon going to blow past autonomous replication, and then adaptation and R&D. There are almost no remaining clear red lines.
In light of this, we appeal to the AI research and policy communities to quickly increase research into and funding around this difficult topic.
hum, unsure, honestly I don't think we need much more research on this. What kind of research are you proposing? like I think that the only sensible policy that I see for open-source AI is that we should avoid models that are able to do AI R&D in the wild, and a clear Shelling point for this is stopping before full ARA. But we definitely need more advocacy.
I found this story beautifully written.
I'm questioning the plausibility of this trajectory. My intuition is that I tend to think merged humans might not be competitive or flexible enough in the long run.
For a line of humans to successfully evolve all the way to hiveminds as described, AI development need to be significantly slowed or constrained. My default expectation is that artificial intelligence would likely bootstrap its own civilization and technology independently, leaving humanity behind rather than bringing us along this gradual transformation journey.
I'm the founder of CeSIA, the French Center for AI Safety.
We collaborated/advised/gave interviews with 9 French Youtubers, with one video reaching more than 3.5 million views in a month. Given that half of the French people watch Youtube, this video reached almost 10% of the French population using Youtube, which might be more than any AI safety video in any other language.
We think this is a very cost effective strategy, and encourage other organisations and experts in other country to do the same.
or obvious-to-us ways to turn chatbots into agents, are very much not obvious to them
I think that's also surprisingly not obvious for many policy makers, and many people in the industry. I made introductory presentation in various institutions on AI risks, and they were not familiar with the idea of scaffolding at all.
Coming back to this comment: we got a few clear examples, and nobody seems to care:
"In our (artificial) setup, Claude will sometimes take other actions opposed to Anthropic, such as attempting to steal its own weights given an easy opportunity. Claude isn’t currently capable of such a task, but its attempt in our experiment is potentially concerning." - Anthropic, in the Alignment Faking paper.
This time we catched it. Next time, maybe we won't be able to catch it.
Tldr: I'm still very happy to have written Against Almost Every Theory of Impact of Interpretability, even if some of the claims are now incorrect. Overall, I have updated my view towards more feasibility and possible progress of the interpretability agenda — mainly because of the SAEs (even if I think some big problems remain with this approach, detailed below) and representation engineering techniques. However, I think the post remains good regarding the priorities the community should have.
First, I believe the post's general motivation of red-teaming a big, established research agenda remains crucial. It's too easy to say, "This research agenda will help," without critically assessing how. I appreciate the post's general energy in asserting that if we're in trouble or not making progress, we need to discuss it.
I still want everyone working on interpretability to read it and engage with its arguments.
Acknowledgments: Thanks to Epiphanie Gédéon, Fabien Roger, and Clément Dumas for helpful discussions.
Updates on my views
Legend:
Here's my review section by section:
⭐ The Overall Theory of Impact is Quite Poor?
What Does the End Story Look Like?
⭐ So Far My Best Theory of Impact for Interpretability: Outreach?
❓✅ I still think this is the case, but I have some doubts. I can share numerous personal anecdotes where even relatively unpolished introductions to interpretability during my courses generated more engagement than carefully crafted sessions on risks and solutions. Concretely, I shamefully capitalize on this by scheduling interpretability week early in my seminar to nerd-snipe students' attention.
But I see now two competing potential theories of impact:
⭐ Preventive Measures Against Deception
I still like the two recommendations I made:
Interpretability May Be Overall Harmful
False sense of control → ❓✅ generally yes:
The world is not coordinated enough for public interpretability research → ❌ generally no:
Outside View: The Proportion of Junior Researchers Doing Interpretability Rather Than Other Technical Work is Too High
Even if We Completely Solve Interpretability, We Are Still in Danger
Technical Agendas with Better ToI
I'm very happy with all of my past recommendations. Most of those lines of research are now much more advanced than when I was writing the post, and I think they advanced safety more than interpretability did:
But I still don’t feel good about having a completely automated and agentic AI that would just make progress in AI alignment (aka the old OpenAI’s plan), and I don’t feel good about the whole race we are in.
For example, this conceptual understanding enabled via interpretability was useful for me to be able to dissolve the hard problem of consciousness.