Why did people miss the point on Mythos?

draganover

I have seen a lot of coverage suggesting that Claude’s new model, Mythos, is a vehicle for Anthropic to peddle hype and doom in order to raise money. While some of this is necessarily motivated by people’s unwillingness to stare into the abyss of our AI future, the breadth of otherwise-reasonable people who have voiced these kinds of cynical opinions suggests that part of the blame rests on Anthropic.

In this post, I want to briefly unpack the way people misinterpreted the evidence, their valid reasons for doing so, and what Anthropic (and the AI safety community more broadly) should learn from this.

In particular, since we will inevitably see other dangerous capabilities spontaneously emerge in the future, we need protocols for how to announce them effectively. To this end, I try to make the following points:

We should be mindful of the public’s sympathy for AI safety prophecies and orient ourselves towards growing this sympathy rather than expending it. I call this our don't-be-annoying capital.
People have good reasons to be skeptical of Anthropic’s claims. Overcoming this requires a particularly high burden of evidence.
We should acknowledge that Anthropic has a conflict of interest when presenting doomer perspectives, and we should account for this going forward.

1. Mythos criticisms & missing the point

In one podcast from The Guardian, the reporter says that “this Mythos debacle [has led to] heads of state saying ‘this is so dangerous, it could shred our infrastructure, the end of civilization is nigh!’”. She then says that “accepting the companies’ premise that they are creating a machine god” is helping these companies sell their product.

In a different podcast, Cal Newport (a professor of computer science at Georgetown University!) concludes his analysis of Mythos by saying “it was wrong for Mythos to get the amount of dread-coverage that it got; so far we do not have evidence that it represents a significantly larger leap in detecting or exploiting vulnerabilities than we’ve seen in previous releases.” He has since gone on other podcasts to make these points.

Similarly, the YouTube channel Internet of Bugs posted a video titled “Anthropic’s $x00 Million Marketing Stunt” saying that “we’ve been seeing a lot of this particular attention-grabbing technique lately. I’m sure it’s going to only be getting worse as the AI companies get more desperate to keep up the flow of investment dollars. How are we supposed to believe this shit?”

To be very clear, the point is not only that Anthropic have a model which is good at cybersecurity tasks. The point is that scaling laws are holding and that the inevitable acceleration continues. With each new model, we unlock new and mysterious risks which we have to grapple with. In this case, it was cybersecurity, but it could have realistically been anything. This bigger story was essentially lost amid the hoopla.

2. It feels like people wanted to miss the point?

I understand the instinct to say a company is hype-mongering when it says it has a big scary thing. E.g., Sam Altman’s tweet of the death star before GPT-5. But I am surprised at how much people are willing to focus on evidence to support their prior that Anthropic is just engaging in corporate shenanigans.

For example, there’s this post from an LLMs-for-cybersecurity org saying that other, smaller models were able to find the bugs that Mythos found. They write “we took the specific vulnerabilities Anthropic showcases in their announcement, isolated the relevant code, and ran them through small, cheap, open-weights models. Those models recovered much of the same analysis.” This was retweeted by the HuggingFace CEO and, subsequently, used as evidence by all three of the above podcasts/videos.

Am I going crazy? Isn’t this pretty bad evidence towards dismissing all the claims Anthropic is making? As said by others, it’s the equivalent of isolating the clump of haystack with a needle in it, giving this clump to a small child, and then saying “wow they were able to find the needle too”. The point is that locating that clump is the hard part!

3. There must be a lesson here.

If we ignore the whole “accelerating us towards the machine god” thing, I think that Anthropic behaved responsibly with Mythos. I also think there are lessons to be learned regarding the publicity, as evidenced by the fact that reasonable people consistently missed the point.

Anthropic is in the business of making extremely unpopular things which they claim could ruin society as we know it. They also repeatedly say it is too dangerous for the public to have access to these things (I note that caution seems generally warranted).

It is reasonable for people to think Anthropic is crying wolf, especially given that they are on track to have the highest IPO of all time and got there, in part, via prophecies.

Unfortunately, I am in the camp of people who think the prophecies are true. This means that, for me, evidence which supports Anthropic’s worldview exists in superposition: it is both hype and responsible, both doomer and appropriate.

But as we’ve seen, it is difficult to convince other people that this is all really happening. They will (with good reason) consider this self-serving and they will (with good reason) not want to confront reality. If people perceive us as crying wolf, they will grow weary of our frenetic anxieties and our cause will go the way of pandemic-preparedness.

Consequently, I think of AI safety as operating with some amount of don’t-be-annoying capital^[1]. This is the amount of sympathy the general public has towards our concerns. It is amassed when AI causes things to go wrong (in the public’s consciousness, not in ours). It is expended when we make claims that people don’t want to believe. It is also expended when AI companies make claims which are plausibly self-serving. It is expended even faster when these claims are poorly substantiated.

I argue we should frame public interactions around trying to grow this resource.

4. Some options for next time

Let’s put ourselves in Anthropic’s shoes: you just made your latest digital nuke. You have to do something with it. What are your options?

You could sit on it quietly. This is obviously terrible, especially if it leaks.
You could release it publicly. This runs the risk of being catastrophic.
You could announce it and route it to defensive/government use. This is what they did with Mythos.
You could announce that you were able to produce it but then destroy it after using it to harden infrastructure. This is likely the most altruistic option, but then you wouldn’t be able to use it to gain an accelerative advantage over your competitors.
You could delete it and never tell anyone. This could also be a PR disaster if it is leaked.

I have the sense that you need to announce it somehow. But, if you announce it, you likely expend capital. If the thing you announce does not live up to the hype, you expend even more capital. And we’ve seen that reasonable people will look for unreasonable excuses to dismiss your claims, draining your capital further.

To this end, Anthropic should have done more due diligence with the Mythos release: their model card (while thorough) did not have the rigor of a comprehensive scientific study. The cybersecurity assessment takes up only 6 pages of the otherwise 200+ page-long system card (pages 46-52); it includes four experiments where they only compare to Anthropic-line models. The blog post goes into more detail about how they found zero-days but again misses some due diligences. For example, they could have evaluated other, non-Anthropic models across the capabilities spectrum to verify that Mythos is uniquely able to do these tasks. They also could have run some controls. Extraordinary claims require extraordinary evidence.

I also believe Anthropic should be more up-front about their conflict of interest with regards to making statements of doom. For as good as I believe their intentions here to be, this conflict of interest is real and people are right to perceive it. One simple option to avoid these perceptions would be to prioritize the analysis coming from independent non-profit evaluators. Credit where it’s due: Anthropic did have the UK AISI provide support for Mythos's capabilities. Across the cynical takes I’ve seen, this analysis was treated with more respect. Of course, even this runs into cynicism, since people will start to think the non-profits are in cahoots with the companies, as evidenced by the comments on the recent NYT profile of METR:

Finally, although random dangerous capabilities will emerge, the point is not any one specific capability. I think letting the narrative get oriented towards the cybersecurity elements of Mythos did a disservice to the public: I doubt most people internalized how big the overall capabilities jump here was, nor that the next such jump will bring new harms into view.

I’m also sure there are considerations which I am not privy to, and recognize that it’s easy to criticize from my cozy corner where nothing I do moves the stock market.

Nonetheless, Anthropic’s first-mover advantage on identifying dangerous capabilities also endows them with a first-doomer responsibility.

thanks to Erin, Steven, Justin, Joseph and Li-lian for comments.

[edit log]: changed some wording and added a sentence about the blog post from the red team. Also changed the title to be more in line with what I was trying to say; old title was "Anthropic spent too much don't-be-annoying capital on Mythos" but, in retrospect, this isn't fully representative of my view.

^{^}
I call it 'don't-be-annoying capital' because that's the lived experience on the receiving end and I think it is good to think about this from the perspective of the audience. I'll admit that something like 'warning fatigue' might be a more representative name.

FWIW here's the feedback that I provided:

(basically steel-manning the case for being skeptical) - I haven't followed the Mythos/Glasswing discussion closely, so this is mostly just 1st principles)

I'm guessing that most skeptics are just operating on quite reasonable priors that companies generally don't economically disadvantage themselves (e.g. by willingly avoiding releasing a product that would (apparently) earn them more money)

When a person has this prior, they have two scenarios to consider:
1. Anthropic is a significant outlier among tech companies and is indeed choosing - for purely ethical reasons - to miss out on revenue gained from releasing a better product during a heated race with >$100B at stake
2. Anthropic is a normal company and is taking this action for commercial reasons. While there might be some plausible ethical justification for the decision to not release this product, their comms / PR / partnership strategy is heavily informed by commercial motivations and is having an undesirable effect (hysteria / hype)

To paint a different scenario:
If Anthropic put out very little communications themselves, and the same information was announced by UK AISI and other partners, then you would imagine that this criticism would have been negated

I think skeptics implicitly know that Anthropic could have done things much more quietly.
To substantiate this intuition, they could have easily:
1) let the UK AISI do their job by releasing their cyber evaluation report like they do for any other model
2) discreetly formed testing & access partnerships, without doing a PR campaign

Link to UK AISI report:
https://www.aisi.gov.uk/blog/our-evaluation-of-claude-mythos-previews-cyber-capabilities

Additionally, Anthropic probably could have foreseen this criticism, and yet they chose to take this approach anyway.

A skeptic needs to have an explanation for why Anthropic decided to put together this very glossy website that looks like a product launch: https://www.anthropic.com/glasswing

To anyone who has a more conservative prior on Anthropic's ethical motivations, the Occam's razor conclusion is that this is a marketing stunt to gain some commercial advantage

The argument that EA / AI safety people commonly make is mostly predicated on Anthropic being an outlier company, which itself requires significantly more evidence than the base case that they are a normal company

Many of us within the space feel like we've seen enough evidence to suggest that Anthropic might be an outlier company. Actions like the DoW incident seem to support this.

I have the sense that you need to announce it somehow.

Why? Project Glasswing was organized semi-privately, its declaration came last. Why couldn't the work simply proceed with less fanfare?

Well, it does feel useful to make public claims about the state of dangerous capabilities so that people can orient accordingly. But I'm trying to make the point here that, if you make these public claims, you should do it through independent third parties who provide irrefutable evidence.

I think you should specify which subgroup of the 'public' is important to keep informed and credulous of claims of cyber risks.

For brevity, I would like to provide a highly uncharitable characterization of the circumstances:

previously, only the Really Dumb People (that is, the majority of Americans who hate AI for frivolous reasons) did not understand e.g. AI coding capabilities, but the Normal People (common AI users) did, because it was accessible / legibly proven.
Mythos caused many Normal AI Users to stop believing in capabilities, because they were published in a biased and untrustworthy manner. This will cause AI Safety to loose in public polling.

I assume there is some obvious flaw with this framing, and hope you may helpfully point it out.
But if it happens to be accurate, then I feel it would reflect a gross overestimation of the causal impact of Anthropic's comms on disbelief, of the ability of people to judge ostensibly irrefutable evidence of model capabilities, and of the political relevance of the particular sphere of people who mislabelled Mythos as a fundraiser.

I think you underestimate the ability of motivated people to deny what's in front of them. But I also wonder if you have the right model of Anthropic's own goals.

Every frontier AI company has in effect decided to compete for ultimate power on Earth. They may have slightly different conceptions of what this would be like (here is my guess as to Amodei's thinking), they may countenance being in a coalition with some of their peers rather than ruling alone, but they have all dared to reach for unlimited power. Their work on AI safety is in service to this goal.

If you are actively part of an effort to create superintelligence, your motivations to warn the world about superintelligence will be a little different, compared to someone whose main goal is just to warn...

Over the past decade or so, "Most people's world models contain ~0 gears" has become an increasingly load-bearing premise in my understanding of humanity. Most people, including the ones publicly writing articles and posts on AI, have so little understanding of the subject matter that how they react to new information is nearly uncorrelated with what the information means to someone who does understand it. And unfortunately, in AI, the reality is often such that most people's reaction to a more-correct interpretation is going to be disbelief unless they have a lot of trust in the judgment of the person expressing it or are willing to spend a lot of time learning about the topic.

I agree with @Justin Olive that "Mythos claims are fake/exaggerated/hype/marketing" is a reasonable default on purely social priors if you know nothing about AI or cybersecurity, and nothing about Anthropic in particular, and simply ignore all the technical information.

FWIW in my experience the most effective counter, at least one-on-one, has been to explain to/remind people that Anthropic has been growing almost 50%/month lately, from $9B in December to $30B in April , and that it is supply constrained on a software product, which I don't think has ever happened before AFAIK.

Am I going crazy? Isn’t this pretty bad evidence towards dismissing all the claims Anthropic is making? As said by others, it’s the equivalent of isolating the clump of haystack with a needle in it, giving this clump to a small child, and then saying “wow they were able to find the needle too”. The point is that locating that clump is the hard part!

As described in the post, their argument is that these models are so cheap that you can effectively cover the entire codebase for a similar cost, so identifying the relevant clump is not necessary.

Relevant quote:

Because small, cheap, fast models are sufficient for much of the detection work, you don't need to judiciously deploy one expensive model and hope it looks in the right places. You can deploy cheap models broadly, scanning everything, and compensate for lower per-token intelligence with sheer coverage and lower cost-per-token. A thousand adequate detectives searching everywhere will find more bugs than one brilliant detective who has to guess where to look.

They also do concede that Mythos' exploit construction specifically is likely unmatched (But not as relevant to their purported goals with Project Glasswing).

I doubt the things you're suggesting would have made much of a difference. People believe what they want to believe.

If anthropic exaggerated Mythos' cyber capabilities, that's on them and a mistake by both profit and safety motives. If they didn't, it's hard to see how they could've acted much better without spending a bunch of time with little return, or allowing leaks to spin rumors out of control, possibly resulting in the same accusations of clever leak based marketing hype in the absence of a statement.

I do think it might've been smarter to lean a little harder on the "seems maybe pretty dangerous so just in case" narrative.

In either case, time will tell - although there's going to be a Y2K effect where using it to patch vulnerabilities reduces evidence that there was ever a danger.

Hopefully that surfaces in the public debate. They could've gotten a lot more hype by releasing it in a way that allowed hackers to demonstrate its capabilities.

Data from Firefox suggests that Mythos found 10x the amount of security vulnerabilities as the previous version Opus 4.6.

We do know that the Chinese hackers previously used Anthropics models to successfully hack targets. I think it's plausible that the Mythos could have given various people who are willing to hack a bunch of "zero days" who likely won't stay zero days very long and are thus cheap to burn along with an easy way to create exploits for them.

One aspect of project Glassdoor seems to be that Anthropic didn't just have Mythos as a model but also spent 100$ million in credits for cybersecurity and another 4$ million in donations to Open Source projects to process the found claims.

Extraordinary claims require extraordinary evidence.

No, that's not how it works in safety regulation. Generally, the precautionary principle suggests that we should not require extraordinary evidence before taking steps to prevent risks. The serious thing to do is to take actions for safeties sake even if you don't have the extraordinary evidence to convince everyone.

From Anthropics perspective, optimizing for convincing every NYT reader is completely unnecessary.

I'm not saying we should optimize for convincing every NYT reader (nor do I think I suggested this?). I'm saying that we should optimize for convincing the general public that these risks should be taken seriously. Due to the public's (reasonable!) priors about anthropic's announcements being self-serving, the standard of evidence to convince a by-default-skeptical person is higher.

You talk about multiple different actors. One the one hand, there might be the LessWrong community and on the other hand there's Anthropic.

When Anthropic makes those decisions, the general public isn't the key audience. Various people in the security community are a key audience. The government is a key audience. Maybe, companies who want to switch to AI agents and investors are also a key audience.

I hear the point you're making. But I guess it's not clear to me who the key audience of the model cards is? The red-teaming demonstrations in them (e.g., blackmailing) are exceptionally valuable for the safety community and as ways to engage the general public. It's not obvious to me that these are directed at the government, investors or companies who want to switch to AI agents?

From the safety community it would be a very strange thing to say something like "Companies should only tell us about safety problems they are having when they can provide extraordinary proof that they have safety problems."

That's not how you handle companies dealing with product safety in any domain. You want companies to share information about safety issues even if the evidence for those safety issues is not overwhelming.

I have the sense that you need to announce it somehow.

Why? Project Glasswing was organized semi-privately, its declaration came last. Why couldn't the work simply proceed with less fanfare?

I think you should specify which subgroup of the 'public' is important to keep informed and credulous of claims of cyber risks.

For brevity, I would like to provide a highly uncharitable characterization of the circumstances:

previously, only the Really Dumb People (that is, the majority of Americans who hate AI for frivolous reasons) did not understand e.g. AI coding capabilities, but the Normal People (common AI users) did, because it was accessible / legibly proven.
Mythos caused many Normal AI Users to stop believing in capabilities, because they were published in a biased and untrustworthy manner. This will cause AI Safety to loose in public polling.

I think you underestimate the ability of motivated people to deny what's in front of them. But I also wonder if you have the right model of Anthropic's own goals.

Am I going crazy? Isn’t this pretty bad evidence towards dismissing all the claims Anthropic is making? As said by others, it’s the equivalent of isolating the clump of haystack with a needle in it, giving this clump to a small child, and then saying “wow they were able to find the needle too”. The point is that locating that clump is the hard part!

As described in the post, their argument is that these models are so cheap that you can effectively cover the entire codebase for a similar cost, so identifying the relevant clump is not necessary.

Relevant quote:

Because small, cheap, fast models are sufficient for much of the detection work, you don't need to judiciously deploy one expensive model and hope it looks in the right places. You can deploy cheap models broadly, scanning everything, and compensate for lower per-token intelligence with sheer coverage and lower cost-per-token. A thousand adequate detectives searching everywhere will find more bugs than one brilliant detective who has to guess where to look.

They also do concede that Mythos' exploit construction specifically is likely unmatched (But not as relevant to their purported goals with Project Glasswing).

I doubt the things you're suggesting would have made much of a difference. People believe what they want to believe.

I do think it might've been smarter to lean a little harder on the "seems maybe pretty dangerous so just in case" narrative.

In either case, time will tell - although there's going to be a Y2K effect where using it to patch vulnerabilities reduces evidence that there was ever a danger.

Hopefully that surfaces in the public debate. They could've gotten a lot more hype by releasing it in a way that allowed hackers to demonstrate its capabilities.

Data from Firefox suggests that Mythos found 10x the amount of security vulnerabilities as the previous version Opus 4.6.

Extraordinary claims require extraordinary evidence.

From Anthropics perspective, optimizing for convincing every NYT reader is completely unnecessary.

You talk about multiple different actors. One the one hand, there might be the LessWrong community and on the other hand there's Anthropic.

47

Why did people miss the point on Mythos?

47

1. Mythos criticisms & missing the point

2. It feels like people wanted to miss the point?

3. There must be a lesson here.

4. Some options for next time

47

47