The blog post was since published. There is a sentence "10x the compute of previous state-of-the-art models" in it that's highly misleading, the claim from the video presentation is that it's 10x Grok 2 compute, and my estimate is that it's about 3e26 FLOPs, or 3x the compute of GPT-4o.
Being rushed is crucial context, there was maybe a month for post-training to produce the Chatbot Arena checkpoint. It feels smart, but has much more trouble seeing intended meaning than Claude 3.6 Sonnet, creating a need for numerous caveats before it understands. I expect this will be fixed in a couple of months, but they couldn't wait, or else it wouldn't have its SOTA moment.
Grok 3 told me 9.11 > 9.9. (common with other LLMs too), but again, turning on Thinking solves it.
This is unrelated to Grok 3, but I am not convinced that the above part of Andrej Karpathy's tweet is a "gotcha". Software version numbers use dots with a different meaning than decimal numbers and there 9.11 > 9.9 would be correct.
I don't think there is a clear correct choice of which of these contexts to assume for an LLM if it only gets these few tokens.
E.g. if I ask Claude, the pure "is 9.11>9.9" question gives me a no, whereas
"I am trying to install a python package. Could you tell me whether `9.11>9.9`?" gives me a yes.
That title is Elon Musk’s fault, not mine, I mean, sorry not sorry:
Table of Contents
Release the Hounds
Grok 3 is out. It mostly seems like no one cares.
I expected this, but that was because I expected Grok 3 to not be worth caring about.
Instead, no one cares for other reasons, like the rollout process being so slow (in a poll on my Twitter this afternoon, the vast majority of people hadn’t used it) and access issues and everyone being numb to another similar model and the pace of events. And because everyone is so sick of the hype.
The timing was a curious thing. Everyone including Musk worked the weekend. They released the model while it was still being trained, and when it could only be rolled out to a small group. No one has API access. There was no model card. We got only a handful of benchmarks. Elon Musk loves to talk about how other people aren’t transparent while revealing very little information himself.
There is the obvious implication that Musk wanted very badly to claim the top spot on Arena and otherwise claim that he had the ‘smartest model in the world’ during the narrow window between now and the release of the full o3 and GPT-4.5, and he knew if OpenAI had wind of his plan too soon or he took too long, they (or Anthropic, or someone else) might beat him to the punch.
Musk presumably wants to send the message xAI has caught up to the pack and is a top tier competitor now. I don’t quite think they’ve earned that, but this was an impressive release relative to expectations. They’re closer than I guessed.
The Expectations Game
[I locked this paragraph on 2/16]: Will Grok 3 live up to Elon’s hype, I asked several days before release? My presumption was no. Teortaxes said yes, John Pressman says there’s a learning curve, presumably implying it’s not that indicative that Grok 1+2 weren’t impressive.
Did Grok 3 fully live up to Elon Musk’s promises? No, but it’s Musk. Of course it didn’t fully live up to his promises. His favorite pastime is saying that which is not via Twitter, so much so that he bought the platform. Your expectations have to adjust for this, and for the previous lousy track record of xAI in particular.
Grok 3 did very clearly exceed expectations. It exceeded my expectations, and it exceeded those of the market. It is at the top of the Arena. In my brief time with it, I’ve found it useful.
I’m not sure I’d say Elon Musk just-barely-delivered, but that’s a reasonable way of looking at it.
After release, a lot of people seem to have retconned their expectations. Of course, they said, with that many GPUs and that much willingness to spend, xAI was going to produce a temporarily close-to-SotA model. Oh, ho hum, another vaguely similarly capable model, who cares, must have been unsurprising.
I did not, and still do not, think that outcome was obvious at all. I absolutely did update positively about the competence and expected future performance of xAI. We can also modestly reduce our variance in that estimate, and our estimate of how much one can do by brute forcing via a giant supercomputer of GPUs. xAI showed it can execute at scale, but also that it probably isn’t doing much special beyond that.
Also, those who actually moved the goalposts to whether Elon’s claim of ‘smartest in the world’ was fully true? Come on. Or in some cases, ‘not AGI yet’? What?
Here’s the obvious evidence that the claim wasn’t true (criteria here is Arena score).
I will note that Google at 1.3% seems way cheap here, if I had handy capital there I’d buy some. I realize it’s less than two weeks to go, but have you seen the leaderboard? It seems entirely plausible that an upgrade to Gemini could leapfrog Grok. Whereas Anthropic at 4% seems rich, Claude does poorly on Arena so even if they did release a killer Sonnet 4.0 or c1 I would be unsurprised if Arena didn’t reflect that, and also they probably wouldn’t test on Arena in advance so there’d be a delay in scoring.
For example, here’s Loss with a meme prediction thread. Here’s a prediction thread.
Given that Grok is #1 on Arena, it’s clearly doing a lot better than those memes.
Actual opinions on Grok 3’s place differ, as they always do, more on that later.
Man in the Arena
Grok-3 takes #1 in Arena across all categories.
As I keep saying, Arena can still help, but has obvious issues. Does anyone else think these coding or overall rankings make all that much sense in detail? I doubt it. But they do tell you important things.
The Official Benchmarks
We didn’t get many to work with, which of course means they are selected.
Normally I’d list a bunch of other stuff here. We don’t have it.
We also don’t have a model card.
We don’t even have a blog post, at least as of me writing this sentence.
We have no indication on a wide array of things.
Who did or did not test this model? For what? Who knows!
We do know that they have a frontier model safety framework, link goes to my coverage on that, but we do not have any explicit statement that they followed it here.
This is, alas, not far from the standard set by OpenAI. They have informed us that releasing something via their $200/month Pro offering does not, for various purposes, count as a release. xAI is (I hope, implicitly) saying that whatever they’ve done does not count, either.
The Inevitable Pliny
Heart in the Wrong Place
The good news is that it wasn’t Grok 3 that was misaligned here. It was Elon Musk.
The actual Grok 3 gives a highly reasonable answer to this question, and other related questions. Indeed, when I asked Grok 3 about reaction to Grok 3, it played it straight.
I do think it is rather terrible that Elon Musk not only thinks this kind of answer would have been good, but that he thinks it is a good idea to say that out loud, with absolutely no shame. What happens when his engineers stop ignoring him on this?
Where Is Your Head At
I thought we mostly knew this already, but that it wasn’t the best way to do it?
Another note is that what they accomplished was very much not cheap. DeepSeek went all-in on compute-efficient training. xAI went all-in on scaling and moar compute. That probably means the Grok 3 model is substantially more compute-intensive to serve, as well, although we cannot know – the estimate here is at least 5x the cost of Sonnet, which itself is not on the cheap end.
Beyond that, we’ll have to revisit ‘how they did it’ once the post and card are out.
Individual Reactions
Andrej Karpathy got early access to run the quick vibe check. He ran it through his standard paces, concluding that Grok 3 + Thinking is effectively a top tier model at a similar level to o1-pro.
I realize his shtick long ago got ridiculous but it’s still informative to know exactly what tack Gary Marcus takes with each new release.
Notice how the takes are compatible technically, but the vibes are very different.
Sully notes that he basically doesn’t know anything yet without API access.
Victor Taelin is the biggest fan I’ve seen.
Other reports were a mixed bag, with the center of the distribution seeming like ‘very good model, passes the vibe check, but mostly not the best tool out there for the job.’ At least, not this time around.
The poll reflects this, with the consensus being mildly below SotA.
At least it didn’t respond with poetry first.
Judd Rosenblatt shares a conversation with Grok 3 and concludes:
Oh no? Elon Musk is to me, at this point, a prime example of unintentional misalignment. Where as his capabilities have advanced and his circumstances take him outside his training distribution, that misalignment has become more severe, and caused more trouble, and is plausibly going to get us all into quite a bit of trouble in various ways.
Grok on Grok
I asked Grok 3 what people on Twitter thought about Grok 3.
I was very happy with with the candor here. If there was one (non-political) place you’d expect a thumb on the scale, this might be it, and there wasn’t one.
I actually think this substantially underestimates Grok 3’s strengths. If its own report is to be believed, the reasoning mode is below other reasoning models, and the non-reasoning mode is worse than Sonnet or GPT-4o on a variety of metrics.
We will of course know more as Grok 3 rolls out to more people, and as they have more time to improve it. I plan to put it in ‘the rotation’ and see how it performs.
For now, xAI has proven it can throw a ton of compute at the problem, and get something reasonable out the other end, and that it is less far behind than we thought. We will see where we go from here.