"The Urgency of Interpretability" (Dario Amodei)

RobertM

31 "The Urgency of Interpretability" (Dario Amodei)

by RobertM

27th Apr 2025

3 min read

31

This is a linkpost for https://www.darioamodei.com/post/the-urgency-of-interpretability

Dario Amodei posted a new essay titled "The Urgency of Interpretability" a couple days ago.

Some excerpts I think are worth highlighting:

The nature of AI training makes it possible that AI systems will develop, on their own, an ability to deceive humans and an inclination to seek power in a way that ordinary deterministic software never will; this emergent nature also makes it difficult to detect and mitigate such developments^[1]. But by the same token, we’ve never seen any solid evidence in truly real-world scenarios of deception and power-seeking^[2] because we can’t “catch the models red-handed” thinking power-hungry, deceitful thoughts.

One might be forgiven for forgetting about Bing Sydney as an obvious example of "power-seeking" AI behavior, given how long ago that was, but lying? Given the very recent releases of Sonnet 3.7 and OpenAI's o3, and their much-remarked-upon propensity for reward hacking and lying to users in self-aware ways, I think we are well past the point where we can credibly claim to not have seen "real-world scenarios of deception". I could try to imagine ways to make those sentences sound more reasonable, but I don't believe I'm omitting relevant context for how readers should understand them.

There are other more exotic consequences of opacity, such as that it inhibits our ability to judge whether AI systems are (or may someday be) sentient and may be deserving of important rights. This is a complex enough topic that I won’t get into it in detail, but I suspect it will be important in the future.

Interesting to note that Dario feels comfortable bringing up AI welfare concerns.

Recently, we did an experiment where we had a “red team” deliberately introduce an alignment issue into a model (say, a tendency for the model to exploit a loophole in a task) and gave various “blue teams” the task of figuring out what was wrong with it. Multiple blue teams succeeded; of particular relevance here, some of them productively applied interpretability tools during the investigation. We still need to scale these methods, but the exercise helped us gain some practical experience using interpretability techniques to find and address flaws in our models.

My understanding is that mech interp techniques didn't provide any obvious advantages over other techniques in that experiment, and SAEs underperform much simpler & cheaper techniques, to the point where GDM is explicitly deprioritizing them as a research direction.

On one hand, recent progress—especially the results on circuits and on interpretability-based testing of models—has made me feel that we are on the verge of cracking interpretability in a big way. Although the task ahead of us is Herculean, I can see a realistic path towards interpretability being a sophisticated and reliable way to diagnose problems in even very advanced AI—a true “MRI for AI”. In fact, on its current trajectory I would bet strongly in favor of interpretability reaching this point within 5-10 years.
On the other hand, I worry that AI itself is advancing so quickly that we might not have even this much time. As I’ve written elsewhere, we could have AI systems equivalent to a “country of geniuses in a datacenter” as soon as 2026 or 2027. I am very concerned about deploying such systems without a better handle on interpretability. These systems will be absolutely central to the economy, technology, and national security, and will be capable of so much autonomy that I consider it basically unacceptable for humanity to be totally ignorant of how they work.

The timelines seem aggressive to me, but I think it's good to have the last sentence spelled out.

If it helps, Anthropic will be trying to apply interpretability commercially to create a unique advantage, especially in industries where the ability to provide an explanation for decisions is at a premium. If you are a competitor and you don’t want this to happen, you too should invest more in interpretability!

Some people predicted that interpretability would be dual-use for capabilities (and that this was a downside that had to be accounted for). I think I was somewhat skeptical at the time, mostly because I didn't expect to see traction on interpretability techniques that were good enough to move the needle very much compared to more direct capabilities research, but to the extent that you trust Dario's research taste this might be an update.

I don't want to sound too down on the essay; I think it's much better than the last one in a bunch of ways, not the least of which is that it directly acknowledges misalignment concerns (though still shies away from mentioning x-risk). There's also an interesting endorsement of a "conditional pause" (framed differently).

I recommend reading the essay in full.

^{^}
Copied footnote: "You can of course try to detect these risks by simply interacting with the models, and we do this in practice. But because deceit is precisely the behavior we’re trying to find, external behavior is not reliable. It’s a bit like trying to determine if someone is a terrorist by asking them if they are a terrorist—not necessarily useless, and you can learn things by how they answer and what they say, but very obviously unreliable."
^{^}
Copied footnote: "I’ll probably describe this in more detail in a future essay, but there are a lot of experiments (many of which were done by Anthropic) showing that models can lie or deceive under certain circumstances when their training is guided in a somewhat artificial way. There is also evidence of real-world behavior that looks vaguely like “cheating on the test”, though it’s more degenerate than it is dangerous or harmful. What there isn’t is evidence of dangerous behaviors emerging in a more naturalistic way, or of a general tendency or general intent to lie and deceive for the purposes of gaining power over the world. It is the latter point where seeing inside the models could help a lot."

Anthropic (org)Interpretability (ML & AI)AI

Frontpage

31

Mentioned in

32AI #115: The Evil Applications Division

New Comment

23 comments, sorted by

top scoring

Click to highlight new comments since: Today at 10:10 AM

[-]Ben Pace2mo1523

I couldn’t get two sentences in without hitting propaganda, so I set it aside. But I’m sure it’s of great political relevance.

[-]Knight Lee2mo*14-4

I'm a bit out of the loop, I used to think Anthropic was quite different from the other labs and quite in sync with the AI x-risk community.

Do you consider them relatively better? How would you quantify the current AI labs (Anthropic, OpenAI, Google DeepMind, DeepSeek, xAI, Meta AI)?

Suppose that the worst lab has a -100 influence on the future, for each $1 they spend. A lab half as bad, has a -50 influence on the future for each $1 they spend. A lab that's actually good (by half as much) might have a +50 influence for each $1.

What numbers would you give to these labs?^[1]

EDIT: I'm asking this question to anyone willing to answer!

^{^}
It's possible this rating is biased against smaller labs since spending a tiny bit increases "the number of labs" by 1 which is a somewhat fixed cost. Maybe pretend each lab was scaled to the same size to avoid this bias against smaller labs.

[-]Ben Pace2mo*1411

I think that Anthropic is doing some neat alignment and control work, but it is also the company most effectively incentivizing people who care about existential risk to sell out, to endorse propaganda, silence themselves, and get on board with the financial incentives of massive monetization and capabilities progress. In this way I see it as doing more damage than OpenAI (though OpenAI used to have this mantle pre-Anthropic, while the Amodei siblings were there and with Christiano as researcher and Karnofsky on the board).

I don't really know the relative numbers, in my mind the uncertainty I have spans orders of magnitude. The numbers are all negative.

[-]MichaelDickens2mo102

Just my personal opinion:

My sense is that Anthropic is somewhat more safety-focused than the other frontier AI companies, in that most of the companies only care maybe 10% as much about safety as they should, and Anthropic cares 15% as much as it should.

What numbers would you give to these labs?

My median guess is that if an average company is -100 per dollar then Anthropic is -75. I believe Anthropic is making things worse on net by pushing more competition, but an Anthropic-controlled ASI is a bit less likely to kill everyone than an ASI controlled by anyone else.

But I also have significant (< 50%) probability on Anthropic being the worst company in terms of actual consequences because its larger-but-still-insufficient focus on safety may create a false sense of security that ends up preventing good regulations from being implemented.

You may also be interested in SaferAI's risk management ratings.

I used to think Anthropic was [...] quite in sync with the AI x-risk community.

I think Anthropic leadership respects the x-risk community in their words but not in their actions. Anthropic says safety is important, and invests a decent amount into safety research; but also opposes coordination, supports arms races, and has no objection to taking unilateral actions that are unpopular in the x-risk community (and among the general public for that matter).

[-]1a3orn2mo40

has no objection to taking unilateral actions that are unpopular in the x-risk community (and among the general public for that matter)

I have the courage of my convictions; you ignore the opinions of others; he takes reckless unilateral action.

[-]MichaelDickens2mo22

The question under discussion was: Is Anthropic "quite in sync with the AI x-risk community"? If it's taking unilateral actions that are unpopular with the AI x-risk community, then it's not in sync.

[-]Casey_2mo86

there were multiple questions under discussion.
his reply could validly be said to apply to the subset of your post, implying or directly saying, that anthropic is doing a bad thing, ie he is highlighting that Disagreement is real and allowed

you are correct that there is a separate vein here about the factual question of whether they are "in sync" with the ai x-risk community. that is a separate question that 1a3orn was not touching with their reply. you are mixing the two frames. if this was intentional then you were being disingenuous. if it was unintentional, you were being myopic

[-]Knight Lee2mo10

Do you see any hope of convincing them they're not net positive influence and they should shut down all their capabilities projects? Or is that simply not realistic human behaviour?

From the point of view of rational self interest, I'm sure they care more about surviving the singularity and living a zillion years, rather than temporarily being a little richer for 3 years^[1] until the world ends (I'm sure these people can live comfortably while waiting).

^{^}
I think Anthropic predicts AGI in 3 years but I'm unsure about ASI (superintelligence)

[-]Ben Pace2mo115

Not only would most people be hopelessly lost on these questions (“Should I give up millions-of-dollars-and-personal-glory and then still probably die just because it is morally right to do so?”), they have also picked up something that they cannot put down. These companies have 1,000s of people making millions of dollars, and they will reform in another shape if the current structure is broken apart. If we want to put down what has been picked up more stably, we must use other forces that do not wholly arise from within the companies.

[-]Knight Lee2mo30

I agree that it's psychologically very difficult, and that "is my work a net positive" is also hard to answer.

But I don't think it's necessarily about millions of dollars and personal glory. I think the biggest difficulty is the extreme social conflict and awkwardness you would have telling researchers who are very personally close to you to simply shut down their project full of hard work, and tell them to oh do something else that probably won't make money and in the end we'll probably go bankrupt.

As for millions of dollars, the top executives have enough money they won't feel the difference.

As for "still probably die," well from a rational self interest point of view they should spend the last years they have left on vacation, rather than stressing out at a lab.

As for personal glory, it's complicated. I think they genuinely believe there is a very decent chance of survival, in which case "doing the hard unpleasant thing" will result in far more glory in the post-singularity world. I agree it may be a factor in the short term.

I think questions like "is my work a net positive?" "Is my ex-girlfriend more correct about our breakup than me?" and "Is the political party I like running the economy better?" are some of the most important questions in life. But all humans are delusional about these most important questions in life, and no matter how smart you are, wondering about these most important questions will simply give your delusions more time find reassurances that you aren't delusional.

The only way out is to look at how other smart rational people are delusional, and how futile their attempts at self questioning are, and infer that holy shit this could be happening to me too without me realizing it.

[-]Ben Pace2mo101

Not sure I get your overall position. But I don’t believe all humans are delusional about the most important questions in their lives. See here for an analysis of pressures on people that can cause them to be insane on a topic. I think you can create inverse pressures in yourself, and you can also have no pressures and simply use curiosity and truth-seeking heuristics. It’s not magic to not be delusional. It just requires doing the same sorts of cognition you use to fix a kitchen sink.

[-]Knight Lee2mo10

Admittedly, I got a bit lost writing the comment. What I should've wrote was: "not being delusional is either easy or hard."

If it's easy, you should be able to convince them to stop being delusional, since it's their rational self interest.
If it's hard, you should be able to show them how hard and extremely insidious it is, and how one cannot expect oneself to succeed, so one should be far more uncertain/concerned about delusion.

[-]MichaelDickens2mo83

I think there is some hope but I don't really know how to do it. I think if their behavior was considered sufficiently shameful according to their ingroup then they would stop. But their ingroup specifically selects for people who think they are doing the right thing.

I have some small hope that they can be convinced by good arguments, although if that were true, surely they would've already been convinced by now? Perhaps they are simply not aware of the arguments for why what they're doing is bad?

[-]Davidmanheim2mo103

Quick take: it's focused on interpretability as a way to solve prosaic alignment, ignoring the fact that prosaic alignment is clearly not scalable to the types of systems they are actively planning to build. (And it seems to actively embrace the fact that interpretability is a capabilities advantage in the short term, but pretends that it is a safety thing, as if the two are not at odds with each other when engaged in racing dynamics.)

[-]MichaelDickens2mo10

prosaic alignment is clearly not scalable to the types of systems they are actively planning to build

Why do you believe this?

(FWIW I think it's foolish that all (?) frontier companies are all-in on prosaic alignment, but I am not convinced that it "clearly" won't work.)

[-]Davidmanheim2mo30

Because they are all planning to build agents that will have optimization pressures, and RL-type failures apply when you build RL systems, even if it's on top of LLMs.

[-]MalcolmMcLeod2mo99

I expected your comment to be hyperbolic, but no. I mean sheesh:

In the decade that I have been working on AI, I’ve watched it grow from a tiny academic field to arguably the most important economic and geopolitical issue in the world. In all that time, perhaps the most important lesson I’ve learned is this: the progress of the underlying technology is inexorable, driven by forces too powerful to stop, but the way in which it happens—the order in which things are built, the applications we choose, and the details of how it is rolled out to society—are eminently possible to change, and it’s possible to have great positive impact by doing so. We can’t stop the bus, but we can steer it. In the past I’ve written about the importance of deploying AI in a way that is positive for the world and of ensuring that democracies build and wield the technology before autocracies do.

(Emphasis mine.) What rhetorical cleverness. This translates as: "I have expertise and foresightedness; here's your Overton window." Then he goes gears-level (ish) for a whole essay, reinscribing in the minds of Serious People the lethal assumptions laid out here: "We can't slow down; if you knew what I knew you'd see the 'forces' that make this obvious, and besides do you want the commies to win?"

I'm not just doing polemic. I think the rhetorical strategy "dismissing pause and cooperation out of hand instead of arguing against them" tells us something. I'm not sure what, alas. I do think that labs' arguments to the governments work best if they've already set the terms of the debate. It helps Dario's efforts if "pause/cooperate" is something "all the serious people know" is not worth paying attention to.

I 80% think he also believes that pausing and cooperation are bad ideas (despite his obvious cognizance of the time-crunch). But I doubt he dismisses it so out-of-hand privately.

[-]ryan_greenblatt2mo81

Ironically, arguably the most important/useful point of the essay is arguing for a rebranded version of the "precisely timed short slow/pause/pivot resources to safety" proposal. Dario's rebranded it as spending down a "security buffer".

(I don't have a strong view on whether this is a good rebrand, seems reasonable to me I guess and the terminology seems roughly as good for communicating about this type of action.)

[-]Casey_2mo10

(genuine question) are you suggesting: he thinks we can stop, but is lying about that?

just because he might not spell out that given node of his argument here doesn't mean it is propaganda. like locally/by-itself i don't see how this could be propaganda, but maybe extra context im missing makes that clear

[-]Ben Pace2mo*42

I'm saying that he is presenting it as something he believes from his place of expertise and private knowledge without argument, because it is something that is exceedingly morally and financially beneficial to him (he gets to make massive money and not be a moral monster), rather than because he has any evidence, and he stated it without evidence.

It is a similar sentence to if a President of a country who just initiated war said “If there’s one thing I’ve learned in my life it’s that war is inevitable, and there’s just a question of who wins and how to make sure it’s over quickly”, in a way that means they should be absolved of responsibility for initiating war.

Edit: Just as Casey B was writing his reply below, I edited an example out of Mark Zuckerberg saying something like "If there's one thing I've learned in my career, it's that social media is good, and the only choice is which sort of good social media we have". Leaving this note so that ppl aren't confused by his reply.

[-]Casey_2mo*10

I truly have no settled opinion on anthropic or you, and i get it's annoying to zoom in on something minor/unimportant like this, but:

i think your lack of charity here is itself a kind of propaganda, a non-truth seeking behavior. even in your mark zuckerberg hypothetical, considered purely "technically"/locally, if he's not lying in that scenario, i don't think it's appropriate to call it propaganda. but of course in that world, this world, he is overwhelming likely to be lying or mind-warped past the point of the difference mattering.

are you not ambiguating/motte-and-bailey-ing between what seems three possible meanings for propaganda ? :
1) something like lies
2) doing something that's also beneficial to you financially
3) stating something without evidence

you know the connotation "propaganda" conveys (1 and general badness), but you fell back on 2 and 3.

also while you may not agree with them, you know there are plenty of arguments (proposed evidence) for why we can't stop. you are therefore being disingenuous. must they be penned in-toto by dario to count? also didn't he put serious effort behind machines of loving grace? (I haven't read it yet). this isn't an 'out of the blue' and/or 'obvious bullshit' position like the zuck hypothetical; the whole AI world is debating/split on this issue; it is reasonably possible he really believes it, etc.

...

edit: saw your comment change just as i posted reply. not your fault tbc, just explaining why some of the content of my reply refers to things that no longer exist

[-]Ben Pace2mo69

I don't think that propaganda must necessarily involve lying. By "propaganda," I mean aggressively spreading information or communication because it is politically convenient / useful for you, regardless of its truth (though propaganda is sometimes untrue, of course).

When a government puts up posters saying "Your country needs YOU" this is intended to evoke a sense of duty and a sense of glory to be had; sometimes this sense of duty is appropriate, but sometimes your country wants you to participate in terrible wars for bad reasons. The government is saying it loudly because for them it's convenient for you to think that way, and that’s not particularly correlated with the war being righteous or with the people who decided to make such posters even having thought much about that question. They’re saying it to win a war, not to inform their populace, and that’s why it’s propaganda.

Returning to the Amodei blogpost: I’ll happily concede that you don’t always need to give reasons for your beliefs when expressing them—context matters. But in every context—tweets, podcasts, ads, or official blogposts—there’s a difference between sharing something to inform and sharing it to push a party line.

I claim that many people have asked why Anthropic believes it’s ethical for them to speed up AI progress (by contributing to the competitive race), and Anthropic have rarely-if-ever given a justification of it. Senior staff keep indicating that not building AGI is not on the table, yet they rarely-if-ever show up to engage with criticism or to give justifications for this in public discourse. This is a key reason why it reads to me as propaganda, because it's an incredibly convenient belief for them and they state it as though any other position is untenable, without argument and without acknowledging or engaging with the position that it is ethically wrong to speed up the development of a technology they believe has a 10-20% chance of causing human extinction (or a similarly bad outcome).

I wish that they would just come out, lay out the considerations for and against building a frontier lab that is competing to reach the finish line first, acknowledge other perspectives and counterarguments, and explain why they made the decision they have made. This would do wonders for the ability to trust them.

(Relatedly, I don't believe the Machines of Loving Grace essay is defending the position that speeding up AI is good; the piece in fact explicitly says it will not assess the risks of AI. Here are my comments at the time on that essay also being propaganda.)

[-]Casey_2mo10

haha: agreed :)
my rage there was around a different level/kind of fakery from anthropic, but i see now how this could connect-with/be part of a broader pattern that i wasn't aware of. remaining quibbles aside, i was wrong; this would be sufficient context/justification for using "propaganda".

I see how it could potentially be the same pattern as people claiming ey hasn't talked enough about his position; to the person you disagree with, you will never have explained enough. but yeah i doubt any of the labs have

Moderation Log

Curated and popular this week