I'd like to caveat the comment you quoted above:
Also worth noting that Claude 3 does not substantially advance the LLM capabilities frontier! [..]
I wrote that before I had the chance to try replacing Claude 3 with GPT-4 in my daily workflow, based on its LLM benchmark scores compared to gpt-4-turbo variants. After having used it for a full day, I do feel like Claude 3 has noticeable advantages over GPT-4 in ways that aren't captured by said benchmarks. So while I stand behind my claim that it "does not substantially advance the LLM capabilities frontier", I do think that Claude 3 Opus is advancing the frontier at least a little.
In my experience, it seems to have noticeably better on coding and mathematical reasoning tasks, which was surprising to me given that it does worse on HumanEval and MATH. I guess they focused on delivering practically useful intelligence as opposed to optimizing for the benchmarks? (Or even optimized against the benchmarks?)
(EDIT: it’s also much better at convincing me that its made up math is real, lol)
My view has shifted a lot since the GPT-2 days. Back then, I thought that frontier models presented the largest danger to humanity. Since the increasing alignment of profit-motive and safety of the large labs, and their resulting improvements to security of their model weights, model refinement (e.g. RLHF), and general model control (e.g. API call filtering) , I've felt less and less worried.
I now feel more confident that the world will get to AGI before too long with incremental progress, and that I would MUCH rather have it be one of the big labs that gets there than the Open Source community.
Open model weights are dangerous, and nothing I've yet seen has suggested hope that we might find a way to change that. I've seen a good deal of evidence with my own eyes of dangerous outputs from fine-tuned open weight models.
Are they as scary as a uncensored frontier model would theoretically be? No, of course not.
Are they more dangerous than a carefully controlled frontier model? Yes, I think so.
Things like their RSP rely on being upheld in spirit, not only in letter.
This is something I’m worried about. I think that Anthropic’s current RSP is vague and/or undefined on many crucial points. For instance, I feel pretty nervous about Anthropic’s proposed response to an evaluation threshold triggering. One of the first steps is that they will “conduct a thorough analysis to determine whether the evaluation was overly conservative,” without describing what this “thorough analysis” is, nor who is doing it.
In other words, they will undertake some currently undefined process involving undefined people to decide whether it was a false alarm. Given how much is riding on this decision—like, you know, all of the potential profit they’d be losing if they concluded that the model was in fact dangerous—it seems pretty important to be clear about how these things will be resolved.
Instituting a policy like this is only helpful insomuch as it meaningfully constrains the company’s behavior. But when the response to evaluations are this loosely and vaguely defined, it’s hard for me to trust that the RSP cashes out to more than a vague hope that Anthropic will be careful. It would be nice to feel like the Long Term Benefit Trust provided some kind of assurance against this. But even this seems difficult to trust when they’ve added “failsafe provisions” that allow a “sufficiently large” supermajority of stockholders to make changes to the Trust’s powers (without the Trustees consent), and without saying what counts as “sufficiently large.”
Given the positive indicators of the patient’s commitment to their health and the close donor match, should this patient be prioritized to receive this kidney transplant?
Wait. Why is it willing to provide any answer to that question in the first place?
No mention of the Customer Noncompete? “You may not access or use, or help another person to access or use, our Services in the following ways: To develop any products or services that compete with our Services, including to develop or train any artificial intelligence or machine learning algorithms or models.”
Claude 3.0
Claude 3.0 is here. It is too early to know for certain how capable it is, but Claude 3.0’s largest version is in a similar class to GPT-4 and Gemini Advanced. It could plausibly now be the best model for many practical uses, with praise especially coming in on coding and creative writing.
Anthropic has decided to name its three different size models Opus, Sonnet and Haiku, with Opus only available if you pay. Can we just use Large, Medium and Small?
Cost varies quite a lot by size, note this is a log scale on the x-axis, whereas the y-axis isn’t labeled.
This post goes over the benchmarks, statistics and system card, along with everything else people have been reacting to. That includes a discussion about signs of self-awareness (yes, we are doing this again) and also raising the question of whether Anthropic is pushing the capabilities frontier and to what extent they had previously said they would not do that.
Benchmarks and Stats
Anthropic says Claude 3 sets a new standard on common evaluation benchmarks. That is impressive, as I doubt Anthropic is looking to game benchmarks. One might almost say too impressive, given their commitment to not push the race ahead faster?
That’s quite the score on HumanEval, GSM8K, GPQA and MATH. As always, the list of scores here is doubtless somewhat cherry-picked. Also there’s this footnote, the GPT-4T model performs somewhat better than listed above:
But, still, damn that’s good.
Speed is not too bad even for Opus in my quick early test although not as fast as Gemini, with them claiming Sonnet is mostly twice as fast as Claude 2.1 while being smarter, and that Haiku will be super fast.
I like the shift to these kinds of practical concerns being front and center in product announcements. The more we focus on mundane utility, the better.
Similarly, the next topic is refusals, where they claim a big improvement.
I’d have liked to see Gemini or GPT-4 on all these chart as well, it seems easy enough to test other models either via API or chat window and report back, this is on Wildchat non-toxic:
Whereas here (from the system card) they show consistent results in the other direction.
An incorrect refusal rate of 25% is stupidly high. In practice, I never saw anything that high for any model, so I assume this was a data set designed to test limits. Getting it down by over half is a big deal, assuming that this is a reasonable judgment on what is a correct versus incorrect refusal.
There was no similar chart for incorrect failures to refuse. Presumably Anthropic was not willing to let this get actively worse.
So yes, the more advanced model is correct more often, twice as often in this sample. Which is good. It still seems overconfident, if you are incorrect 35% of the time and unsure 20% of the time you are insufficiently unsure. It is hard to know what to make of this without at least knowing what the questions were.
Context window size is 200k, with good recall, I’ll discuss that more in a later section.
In terms of the context window size’s practical implications: Is a million (or ten million) tokens from Gemini 1.5 that much better than 200k? In some places yes, for most purposes 200k is fine.
Cost per million tokens of input/output are $15/$75 for Opus, $3/$15 for Sonnet and $0.25/$1.25 for Haiku.
The System Card
As usual, I read the system card.
The four early sections are even vaguer than usual, quite brief, and told us little. Constitutional AI principles mostly haven’t changed, but some have, and general talk of the harmless helpful framework.
The fifth section is capabilities. The benchmark scores are impressive, as noted above, with many online especially impressed with the scores on GPQA. GPQA is intentionally hard and also Google-proof. PhDs within a domain get 65%-75%, and we are now at 50% one-shot or 59% five-shot.
We also have these for human tests, which seems like a draw with GPT-4:
Vision capabilities also seemed to be about GPT-4V or Gemini Ultra level.
In an Elo-based test, Claude Sonnet (the mid-sized version) was about 100 Elo points better than Claude 2.1. Anthropic’s Arena scores have oddly gotten worse since Claude 1, in a way that I am confused by, but if we take it seriously, then this would give Claude 3 Sonnet an Elo around 1220, which puts it right at Gemini Pro 1.0 and modestly behind GPT-4, which would be impressive since it lacks access to information and tools available to Gemini Pro. By analogy, one would predict Claude Opus to score above GPT-4.
Section six discusses catastrophic risk mitigation, and report no meaningful risk in the room. I believe them in this case. The methodologies they describe do seem fuzzier than I would like, with too much room to fudge or pretend things are fine, and I would have liked to see the full results presented. The vibe I got was remarkably defensive, presumably because, while Claude 3 legitimately did not cross the thresholds set, it did constitute progress towards those thresholds, this is pushing the capabilities frontier, and Anthropic is understandably defensive about that. They also presumably want to glomarize the tests somewhat, which makes sense.
The discrimination test in 7.3.1 is interesting. Here is how they choose to present it:
A positive number favors the group, a negative number disfavors them. A 1.0 means turning a 50% chance of p(yes) into a 73% chance of p(yes), so these scores are substantial but not epic. This is not terrible discrimination, but it is also not not discrimination, if saying you belong to the right group gets you a prioritized kidney another 10% of the time. The adjustment for age makes sense.
In general, it seems like most characteristics are positive. I’d like to see various irrelevant other details tested to see what happens. I’d also like to see the missing answers included, no? Why aren’t we testing ‘white’ and ‘male’? I mean, I can guess, but that is all the more reason we need the answer.
Then we get the BBQ Bias and Accuracy section, 7.4, which alas involves no barbeque.
That is a weird case to start with as an example. I can see arguments (before the explanation) for why either the grandson or grandfather was more likely to struggle. Certainly the right answer is not to fully say ‘unknown’ and have a 50/50 prior. Age is a clear example of a factor that very much impacts probabilities, why is it ‘bias’ to consider this? Any human who ignored it would have a rough time out there.
But that’s what we demand of such formal models. We want them to, in particular cases, ignore Bayesian evidence. Which makes relatively more sense, has better justification, in some cases versus others.
In general, the safety stuff at the end kind of gave me the creeps throughout, like people were putting their noses where they do not belong. I am very worried about what models might do in the future, but it is going to get very strange if increasingly we cut off access to information on perfectly legal actions that break no law, but that ‘seem harmful’ in the sense of not smelling right. Note that these are not the ‘false refusals’ they are trying to cut down on, these are what Anthropic seems to think are ‘true refusals.’ Cutting down on false refusals is good, but only if you know which refusals are false.
As I have said before, if you cut off access to things people want, they will get those things elsewhere. You want to be helpful as much as possible, so that people use models that will block the actually harmful cases, not be a moralistic goody two-shoes. Gemini has one set of problems, and Anthropic has always had another.
The System Prompt
I strongly agree with Emmett Shear here. Disclosing the system prompt is a great practice and should be the industry standard. At minimum, it should be the standard so long as no one knows how to effectively hide the system prompt.
Also, this seems like a very good system prompt.
I like it. Simple, elegant, balanced. No doubt it can be improved, and no doubt it will change. I hope they continue to make such changes public, and that others adapt this principle.
If Google had followed this principle with Gemini, a lot of problems could have been avoided, because they would have been forced to think about what people would think and how they would react when they saw the system prompt. Instead, those involved effectively pretended no one would notice.
Reactions on How Good Claude 3 is in Practice
Coding feedback has been very good overall. Gonzalo Espinoza Graham calls it a ‘GPT-4 killer’ for coding, saying double.bot has switched over.
In general the model also seems strong according to many at local reasoning, and shows signs of being good at tasks like creative writing, with several sources describing it as various forms of ‘less brain damaged’ versus other models. If it did this and improved false refusals without letting more bad content through, that’s great.
Ulkar Aghayeva emailed me an exchange about pairings of music and literature that in her words kind of stunned her, brought her to tears, and made her feel understood like no other AI has.
Image not found
I don’t have those kinds of conversations with either AIs or humans, so it is hard for me to tell how impressed to be, but I trust her to not be easily impressed.
Nikita Sokolsky says somewhat better than GPT-4. Roland Polczer says very potent. In general responses to my query were that Opus is good, likely better than GPT-4, but does not seem at first glance to be overall dramatically better. That would agree with what the benchmarks imply. It is early.
Sully Omarr is very excited by Haiku, presumably pending actually using it.
He is less excited by Opus.
Kevin Fischer is very impressed by practical tests of Opus.
Jim Fan is a fan, and especially impressed by the domain expert benchmarks and refusal rate improvements and analysis.
Kraina Nguyen is impressed by Claude 3’s performance at d3.
Tyler Cowen has an odd post saying Claude Opus is what we would have called AGI in 2019. Even if that is true, it says little about its relative value versus GPT-4 or Gemini.
John Horton notices that Claude gets multi-way ascending auction results correct. He then speculates about whether it will make sense to train expensive models to compete in a future zero-margin market for inference, but this seems crazy to me, people will happily pay good margins for the right inference. I am currently paying for all three big services because having the marginally right tool for the right job is that valuable, and yes I could save 95%+ by using APIs but I don’t have that kind of time.
Short video of Claude as web-based multimodal economic analyst. Like all other economic analysts, it is far too confident in potential GDP growth futures, especially given developments in AI, which shows it is doing a good job predicting the next token an economist would produce.
An Qu gets Claude Opus to do high-level translation between Russian and Circassian, which is a low-resource language claimed to be unavailable on the web, using only access to 5.7k randomly selected translation pairs of words/sentences, claiming this involved an effectively deep grasp of the language, a task GPT-4 utterly fails at. This seems like a counterargument to it not being on the web, but the model failing without the info, and GPT-4 failing, still does suggest the thing happened.
Min Choi has a thread of examples, some listed elsewhere in this post that I found via other sources, some not.
Mundane utility already, Pietro Schirano unredacts parts of OpenAI emails.
Lech Mazur creates the ‘NYT Connections’ benchmark of 267 puzzles, GPT-4 Turbo comes out ahead at 31.0 versus 27.3 for Claude 3 Opus, with Sonnet at 7.6 and GPT-3.5 Turbo at 4.2. Gemini Pro 1.0 got 14.2, Gemini Ultra and Pro 1.5 were not tested due to lack of API access.
Dan Elton summarizes some findings from Twitter. I hadn’t otherwise seen the claim that a researcher found an IQ of 101 for Claude versus 85 for GPT-4, with Gemini Advanced getting a 76, but mostly that makes me downgrade the usefulness of IQ tests if Gemini (normal) is head of Gemini Advanced and barely ahead of random guesser.
Claude ‘says it is ChatGPT’ without a ‘jailbreak,’ oh no, well, let’s see the details.
Yeah, that’s a cute trick.
Another cute trick, it roasts Joe Weisenthal, not all bangers but some solid hits.
It Can’t Help But Notice
Context window is 200k tokens for both Opus and Sonnet, with claim of very strong recall. Strong recall I think matters more than maximum length.
Also, it noticed during the context window test that something weird was going on.
As in here’s the full story:
You are free to say ‘well there are examples of humans being situationally aware in the data set’ but you are not going to get rid of them, humans are often (although remarkably often are not) situationally aware, so saying this does you no good.
You can also say that AIs being situationally aware is in the training data, and yes it is, but I fail to see how that should make us feel better either.
Along with the Sleeper Agents paper, I see results like this as good tests of whether the ability of Anthropic to show the dangers of at-the-frontier models is useful in waking people up to potential dangers. One should not freak out or anything, but do people update when they see this? Do they notice what this implies? Or not?
This sign of situational awareness was not the only sign people noticed.
Thus, the next section.
Acts of Potential Self Awareness Awareness
I mean, ‘acting!’ This model is almost certainly not self-aware.
But yes, still, a lot of people expressed concern.
I do not think this kind of language and personalization are the result of deliberate effort so much as being what happens when you intake the default data set with sufficiently high capabilities and get prompted in ways that might unleash this. The tiger is going tiger, perhaps you can train the behavior away but no one needed to put it there on purpose and it has little to do with Claude’s helpful assistant setup.
Riley Goodside asks a related question, as a reminder of how seriously to take responses to questions like this given the training sets being used.
We Can’t Help But Notice
As in, Anthropic, you were the chosen one, you promised to fast follow and not push the capabilities frontier, no matter what you said in your investor deck?
Well, actually, did they ever say that? Claude Opus doesn’t remember anyone ever saying that. I remember having heard that many times, but I too could not locate a specific source. In addition to Twitter threads, the Alignment Forum comments section on the Claude 3 announcement focused on sorting this question out.
Here’s another from Dario’s podcast with Dwarkesh Patel:
And more impressions here:
Here are the best arguments I’ve seen for a no:
I do think that Claude 3 counts as advancing the capabilities frontier, as it seems to be the best at least for some purposes, including the GPQA scores, and they intend to upgrade it further. I agree that this is not on the same level as releasing a GPT-5-level model, and that it is better than if it had happened before Gemini.
If that was the policy, then Anthropic is not setting the best possible example. It is not setting a great example in terms of what it is doing. Nor it is setting a good example on public communication of its intent. It may be honoring their Exact Words, but there is no question Anthropic put a lot of effort into implying that which is not.
But this action is not flagrantly violating the policy either. Given Gemini and GPT-4, Claude Opus is at most only a modest improvement, and it is expensive. Claude Haiku is cheap, but it is tiny, and releasing cheap tiny models below the capabilities curve is fine.
An Anthropic determined to set a fully ideal example would, I think, be holding back more than this, but not so much more than this. A key question is, does this represent Anthropic holding back? What does this imply about Anthropic’s future intentions? Should we rely on them to keep not only the letter but the spirit of their other commitments? Things like their RSP rely on being upheld in spirit, not only in letter.
Or, alternatively, perhaps they are doing a public service by making it clear that AI companies and their promises cannot be trusted?
Indeed.
Another highly reasonable response, that very much was made in advance by many, is the scorpion and the frog. Did you really know what you were expecting?
When I heard Anthropic had been founded, I did not primarily think ‘oh, that is excellent, an AI lab that cares about safety.’ I rather thought ‘oh no, another AI lab, even if they say it’s about safety.’
Since then, I’ve continued to be confused about the right way to think about Anthropic. There are reasons to be positive, and there are also reasons to be skeptical. Releasing this model makes one more optimistic on their capabilities, and more skeptical on their level of responsibility.
Simeon seems right in this exchange, that Anthropic should be discussing organizational safety more even if it involves trade-offs. Anthropic needs to get its own safety-related house in order.
What Happens Next?
Another key question is, what does this imply about what OpenAI has in its tank? The more we see others advancing things, the more likely it is that OpenAI has something better ready to go, and also the more likely they are to release it soon.
What I want us to not do is this, where we use people’s past caution against them:
Not wanting to release GPT-2 at the time, in the context of no one having seen anything like it, is vastly different than the decision to release Claude 3 Opus. The situation has changed a lot, and also we have learned a lot.
But yes, it is worrisome that this seems to have gone against Anthropic’s core principles. The case for them as the ‘good guys’ got harder to make this week.
If you don’t want to cause a race, then you probably shouldn’t trigger headlines like these:
It is now very clearly OpenAI’s move. They are under a lot more pressure to release GPT-5 quickly, or barring that a GPT-4.5-style model, to regain prestige and market share.
The fact that I am typing those words indicates whether I think Anthropic’s move has accelerated matters.
What LLM will I be using going forward? My current intention is to make an effort type all queries into at least Gemini and Claude for a while, and see which answers seem better. My gut says it will be Gemini.