Paging Gwern or anyone else who can shed light on the current state of the AI market—I have several questions.

Since the release of ChatGPT, at least 17 companies, according to the LMSYS Chatbot Arena Leaderboard, have developed AI models that outperform it. These companies include Anthropic, NexusFlow, Microsoft, Mistral, Alibaba, Hugging Face, Google, Reka AI, Cohere, Meta, 01 AI, AI21 Labs, Zhipu AI, Nvidia, DeepSeek, and xAI.

Since GPT-4’s launch, 15 different companies have reportedly created AI models that are smarter than GPT-4. Among them are Reka AI, Meta, AI21 Labs, DeepSeek AI, Anthropic, Alibaba, Zhipu, Google, Cohere, Nvidia, 01 AI, NexusFlow, Mistral, and xAI.

Twitter AI (xAI), which seemingly had no prior history of strong AI engineering, with a small team and limited resources, has somehow built the third smartest AI in the world, apparently on par with the very best from OpenAI.

The top AI image generator, Flux AI, which is considered superior to the offerings from OpenAI and Google, has no Wikipedia page, barely any information available online, and seemingly almost no employees. The next best in class, Midjourney and Stable Diffusion, also operate with surprisingly small teams and limited resources.

I have to admit, I find this all quite confusing.

I expected companies with significant experience and investment in AI to be miles ahead of the competition. I also assumed that any new competitors would be well-funded and dedicated to catching up with the established leaders.

Understanding these dynamics seems important because they influence the merits of things like a potential pause in AI development or the ability of China to outcompete the USA in AI. Moreover, as someone with general market interests, the valuations of some of these companies seem potentially quite off.

So here are my questions:

1. Are the historically leading AI organizations—OpenAI, Anthropic, and Google—holding back their best models, making it appear as though there’s more parity in the market than there actually is?
  
2. Is this apparent parity due to a mass exodus of employees from OpenAI, Anthropic, and Google to other companies, resulting in the diffusion of "secret sauce" ideas across the industry?

3. Does this parity exist because other companies are simply piggybacking on Meta's open-source AI model, which was made possible by Meta's massive compute resources? Now, by fine-tuning this model, can other companies quickly create models comparable to the best?

4. Is it plausible that once LLMs were validated and the core idea spread, it became surprisingly simple to build, allowing any company to quickly reach the frontier?

5. Are AI image generators just really simple to develop but lack substantial economic reward, leading large companies to invest minimal resources into them?

6. Could it be that legal challenges in building AI are so significant that big companies are hesitant to fully invest, making it appear as if smaller companies are outperforming them?

7. And finally, why is OpenAI so valuable if it’s apparently so easy for other companies to build comparable tech? Conversely, why are these no name companies making leading LLMs not valued higher?

Of course, the answer is likely a mix of the factors mentioned above, but it would be very helpful if someone could clearly explain the structures affecting the dynamics highlighted here.

New Answer
New Comment

5 Answers sorted by

Nathan Helm-Burger

5811

1. Are the historically leading AI organizations—OpenAI, Anthropic, and Google—holding back their best models, making it appear as though there’s more parity in the market than there actually is?
  

4. Is it plausible that once LLMs were validated and the core idea spread, it became surprisingly simple to build, allowing any company to quickly reach the frontier?

I'll try to address just these two points with what I know, which is limited.

OpenAI, Anthropic, Google / Google Deepmind, Meta, AWS, and NVIDIA are the main current holders of AI compute hardware. Of these, the majority of frontier engineering talent (and expressed intent to race for AGI) seems to be concentrated in OpenAI, Anthropic, and Google / Google Deepmind. I think of these as a group I call 'the Top 3'.

So far as the public knows, the Top 3 are working hard on preparing the next generation of LLMs (now all multimodal, so LLM is a bit of a misnomer). Preparing a next generation takes time and effort from the researchers, plus engineering support and training time on the large clusters. We are roughly halfway through the expected interval between full-step versions (e.g. GPT-4 to GPT-5). Of these companies, my best guess is that OpenAI has a slight lead, and thus will probably deploy their next gen (i.e. GPT-5) before the other companies deploy theirs (Claude 4 Opus, Google Gemini 2 Ultra). The time lag may be a couple of weeks or as much as six months. Hard to say for sure. Google has a lot of resources, and so might slightly beat OpenAI.

In any case, the big differentiator in terms of the Top 3 versus the Rest is the massive amount of hardware. GPT-4 level was something that could be trained on a wide variety of different rentable resources. The scaling trend suggests that the next level up, GPT-5, will require more resources than the Rest are expected to be able to muster this year. Since hardware is advancing, some members of the Rest may acquire GPT-5 level resources late next year (2025) or early 2026. This means they'll be quite a few months behind. As for the implied resources needed for GPT-6 level if scaling trends and costs continue on trend, it seems unlikely that most of the Rest will be able to afford to scale that large (even acting late, and with rented resources) with the exception of Meta and possibly xAI.

Current opinion, which I agree with, puts Claude 3.5 Sonnet at about 4.25 - 4.3 level compared to GPT-4. All of the others (including GPT-4o) fall somewhere in-between level 4 and level 4.3

Nobody has a level 5 yet because even the leaders haven't had the time to create and deploy it yet!

As for "secret sauce" ideas, we haven't yet gotten public knowledge about any secret knowledge being a hard blocker. It does seem like there is a fair amount of small technical secrets which improve compute efficiency or capabilities in minor ways, but that these can all be compensated for by spending more on compute or coming up with alternate approaches. There is a huge amount of ML research being published every month now because the field has gotten so lucrative and trendy. The newest public research isn't yet present in the deployed models because the models were trained and deployed before the new research was published!

This makes for a strange dynamic that those companies which are lagging slightly behind in getting their large training runs started get the advantage of later cut-off date for incorporation of public research. I think that getting to incorporate more of the latest research is a significant part of the explanation for why the GPT-4-sized models of late-movers have slightly surpassed GPT-4.

This doesn't imply that OpenAI is losing the race, or that they don't have valuable technical secrets. The Top 3 are still in the lead, so far as I can foresee, they just are in the 'hidden progress' phase which comes between model generations. Because of this, we can't know their relative standing for certain. Presumably, even they don't know since they don't have the details on the secret tech that their competitors are putting into their next generation. We will need to wait and see how the next generation of the Top 3's models compare to each other.

Unclear if going beyond GPT-5 will be crucial, at that point researchers might get more relevant than compute again. GPT-4 level models (especially the newer ones) have the capability to understand complicated non-specialized text (now I can be certain some of my more obscure comments are Objectively Understandable), so GPT-5 level models will understand very robustly. If this is sufficient signal to get RL-like things off the ground (automating most labeling with superhuman quality, usefully scaling post-training to the level of pre-training), more scale won't necessarily help on the currently-somewhat-routine pre-training side.

I think a little more explanation is required on why there isn't already a model with 5-10x* more compute than GPT-4 (which would be "4.5 level" given that GPT version numbers have historically gone up by 1 for every two OOMs, though I think the model literally called GPT-5 will only be a roughly 10x scale-up). 

You'd need around 100,000 H100s (or maybe somewhat fewer; Llama 3.1 was 2x GPT-4 and trained using 16,000 H100s) to train a model at 10x GPT-4.  This has been available to the biggest hyperscalers since sometime last year. Naively it might... (read more)

GPT-4 (Mar 2023 version) is rumored to have been trained on 25K A100s for 2e25 FLOPs, and Gemini 1.0 Ultra on TPUv4s (this detail is in the report) for 1e26 FLOPs. In BF16, A100s give 300 teraFLOP/s, TPUv4s 270 teraFLOP/s, H100s 1000 teraFLOP/s (marketing materials say 2000 teraFLOP/s, but that's for sparse computation that isn't relevant for training). So H100s have 3x advantage over hardware that trained GPT-4 and Gemini 1.0 Ultra. Llama-3-405b was trained on 16K H100s for about 2 months, getting 4e25 BF16 FLOPs at 40% compute utilization.

With 100K H100s, 1 month at 30% utilization gets you 8e25 FLOPs. OpenAI might have obtained this kind of training compute in May 2024, and xAI might get it at the end of 2024. AWS announced access to clusters with 20K H100s back in July 2023, which is 2e25 FLOPs a month at 40% utilization.

So assuming AWS's offer is real for the purpose of training a single model on the whole 20K H100s cluster and was sufficiently liquid, for a year now 6 months of training could have yielded a 1.2e26 FLOPs model, which is 6x GPT-4, 3x Llama-3-405b, or on par with Gemini 1.0 Ultra. But much more than that wasn't yet possible, not without running multiple such clu... (read more)

6Nathan Helm-Burger
Yes, good point Josh. If the biggest labs had been pushing as fast as possible, they could have a next model by now. I don't have a definite answer to this, but I have some guesses. It could be a combination of any of these. * Keeping up with inference demand, as Josh mentioned * Wanting to focus on things other than getting the next big model out ASAP: multimodality (e.g. GPT-4o), better versions of cheaper smaller models (e.g. Sonnet 3.5, Gemini Flash), non-capabilites work like safety or watermarking * choosing to put more time and effort into improving the data /code/ training process which will be used for the next large model run. Potentially including: smaller scale experiments to test ideas, cleaning data, improving synthetic data generation (strawberry?), gathering new data to cover specific weak spots (perhaps by paying people to create it), developing and testing better engineering infrastructure to support larger runs * wanting to spend extra time evaluating performance of the checkpoints partway through training to make sure everything is working as expected. Larger scale means mistakes are much more costly. Mistakes caught early in the training process are less costly overall. * wanting to spend more time and effort evaluating the final product. There were several months where GPT-4 existed internally and got tested in a bunch of different ways. Nathan Labenz tells interesting stories of his time as a pre-release tester. Hopefully, with the new larger generation of models the companies will spend even more time and effort evaluating the new capabilities. If they scaled up their evaluation time from 6-8 months to 12-18 months , then we'd expect that much additional delay. We would only see a new next-gen model publicly right now if they had started on it ASAP and then completely skipped the safety testing. I really hope no companies choose to skip safety testing! * if safety and quality testing is done (as I expect it will be), then flaws fo

habryka

4914

The release of Llama 405b was the thing that most succinctly explained this to me. At least when it comes to the current generation of cutting edge LLMs, there is no secret sauce. Llama 405b is a cutting edge model with, as far as I can tell, no advances in architecture or training compared to the development of GPT-3. Indeed, it appears in architecture substantially simpler than GPT-4 while outperforming it, suggesting that in the long-run, simplicity of architecture tends to win out, especially if you are willing to take a relatively small (<3x) compute-cost hit.

The architecture is a straightforward transformer with no mixture of experts or anything fancy: 

The training process did nothing interesting. It used the most obvious implementation of supervised fine-tuning and reinforcement training. 

The data cleaning process was somewhat more involved, and we know less about, but I think is unlikely to have done anything like synthetic data generation or complicated AI-assisted review.

This might all again change with the next generation of LLMs (especially with things like Strawberry, which looks like it might do something more interesting), but at least right now, I think almost any competent engineering team in the world could build a cutting-edge AI model, if they were just willing to spend the compute. It requires overcoming some minor engineering challenges, but the basics of how to do this are figured out. There is no moat.

Llama 405B was trained on a bunch of synthetic data in post-training for coding, long-context prompts, and tool use (see section 4.3 of the paper).

@ryan_greenblatt: Curious if you have a quick example of an architectural change from GPT-3. Quick googling/perplexing maybe suggests some changes in the attention algorithm (grouped-query attention instead of whatever GPT-3 was doing). 

I was trying to just highlight "training" rather than architecture. I think there are architecture changes (swigelu, grouped-query attention, probably somewhat better tuned transformer hparams like layer count etc.) though these are perhaps minor.

My understanding of the key training advances relative to GPT3:

  • Closer to chinchilla optimal via having enough data. (I think 405b is 2x too much data according to chinchilla while GPT3 is 8x too little data.)
  • Better data. The paper says "Compared to prior versions of Llama (Touvron et al., 2023a,b), we improved both the quantity and quality of the data we use for pre-training and post-training."

I think 405b is 2x too much data according to chinchilla while GPT3 is 8x too little data

They did the Chinchilla scaling experiments themselves, it's in the report (Section 3.2.1 Scaling Laws). The result claims that 40 tokens/parameter is actually optimal in their setup (2x more than in the Chinchilla paper), so Llama-3-405b is Chinchilla optimal in the relevant sense, it's not trained on too much data. The result is slightly suspicious in that their largest datapoints are 1e22 FLOPs, while Llama-3-405b itself is 4e25 FLOPs, so that's a lot of extrapolation. But overall they find that the optimal tokens/parameter ratio increases with compute, more so than in the Chinchilla paper, and Llama-3-405b had more compute than Chinchilla.

This is also consistent with the CARBS experiments done by Imbue (search for "tokens per parameter"):

Another interesting finding is the optimal number of tokens per parameter. We found this optimal number to be slightly increasing across our range of experiments (see the dashed black line). Note that our methodology differed from that of Chinchilla in a few significant ways: we explicitly scaled the number of machines together with the model size, effectively changing the batch size.

6habryka
Ah, sorry, yeah, I basically agree with this. I do think the scaling law stuff made a big difference. I commented a bit on the training data stuff, but my best guess is the changes there are also minor (besides the sheer volume).
[-]leogao154

Keep in mind that if, hypothetically, there were major compute efficiency tricks to be had, they would likely not be shared publicly. So the absence of publicly known techniques is not strong evidence in either direction.

Also, in general I start from a prior of being skeptical of papers claiming their models are comparable/better than GPT-4. It's very easy to mislead with statistics - for example, human preference comparisons depend very heavily on the task distribution, and how discerning the raters are. I have not specifically looked deeply into Llama 405B though.

6habryka
That's true, though I do think there are various proxies that make at least the extreme end of this kind of thing for currently deployed models relatively easy to rule out (like the compute-purchase and allocation decisions of major cloud providers who host some of these models, and staff allocation and various other things). I do think most organizations who claim parity with GPT-4 or Sonnet are almost always overstating things. My experience with 405b suggests it is also not at the level of Claude 3.5 Sonnet, but it does seem to be at the level of the original GPT-4, though I am not confident since I haven't played around that much with it GPT-4 recently. 

Yeah, I mostly agree. I would say that there may or may not be certain secret techniques which will give models a slightly lower loss plateau for a given parameter count. That matters more to the large companies than compute efficiency, I think. 

Accumulate enough loss-plateau-lowering tidbits, and it could add up to having the best model out of a group of similarly sized models.

elifland

2620

Twitter AI (xAI), which seemingly had no prior history of strong AI engineering, with a small team and limited resources

Both of these seem false.

Re: talent, see from their website:

They don't list their team on their site, but I know their early team includes Igor Babuschkin who has worked at OAI and DeepMind, and Christian Szegedy who has 250k+ citations including several foundational papers.

Re: resources, according to Elon's early July tweet (ofc take Elon with a grain of salt) Grok 2 was trained on 24k H100s (approximately 3x the FLOP/s of GPT-4, according to SemiAnalysis). And xAI was working on a 100k H100 cluster that was on track to be finished in July. Also they raised $6B in May.

And xAI was working on a 100k H100 cluster that was on track to be finished in July.

According to DCD, that should be fall 2025. Planned power is 150 megawatts or possibly 50+150 megawatts, which is good for 100K H100s, but not more than that. The request for the 150 megawatts is still being discussed by the utilities, as of August 2024. Any future Blackwells will need to go elsewhere, the whole plan for this datacenter seems to be the 100K H100s. (This costs about $5bn, and xAI only closed its $6bn Series B in May 2024.)

according to Elon ... Grok 2 wa

... (read more)
9ryan_greenblatt
Unless Elon is lying, it was operational as of July, though perhaps only with about 32k of the H100s rather than all of them. My understanding is that at least 64k are operational now. Yes, though mobile generators are in use which could power at least a large fraction of the H100s. See discussion here.
6ryan_greenblatt
Seems to be fully online as of now (Sep. 2) based on this tweet?
6ryan_greenblatt
I now think this is false. From The Information:
9gwern
Keep in mind Musk never said it was "fully online" or "100,000 GPUs are running concurrently" or anything like that. He only said that the cluster was "online", which could mean just about anything, and that it is "the most powerful AI training system", which is unfalsifiable (who can know how powerful every AI training system is worldwide, including all of the secret proprietary ones by FANG etc?) and obvious pure puffery ("best pizza in the world!"). If you fell for it, well, then the tweet was for you.
2Vladimir_Nesov
I wonder if it's all running on generators, and what this means about Grok-3. With 30K H100s, 1.5 months only get 4e25 FLOPs, the Llama-3 compute. I'm guessing they'd want 1e26 FLOPs or so to get a meaningful improvement over Grok-2, which is 2 more months. But in 2 months, 100K H100s give 1.6e26 FLOPs (I'm assuming slightly worse utilization). Maybe figuring out how to be efficient with including more compute into a run that has already started is part of the plan, so that in a few more months the mentioned scaleup to further 50K H100s and 50K H200s could happen mid-run for Grok-4? Sounds dubious.
5Vladimir_Nesov
Memphis datacenter might be operational in some form, but the 100K H100s cluster is not operational, and I was responding to elifland's specific claim about "a 100k H100 cluster that was on track to be finished in July". The point is, the scale that's beyond what you can get from AWS is not going to be available for some time. This is a point journalists repeatedly got wrong, what is claimed is that something is operational in July, and that the datacenter is planned to have 100K H100s, but it doesn't follow that 100K H100s are operational in July. By analogy with Llama-3-405b, Grok-2 started training no later than Mar-Apr 2024 (it needs to finish pre-training, and then go through RLHF), so it wasn't trained using the Memphis datacenter. And in its current state, the Memphis datacenter won't significantly improve on that scale, the bulk of the improvement would need to come from training for more months. If by the end of 2024, both 100K H100s and the 150 megawatts substation are ready, then xAI will start to catch up with OpenAI, which might already be training at that scale since May. So Grok-3 is probably using these 30K H100s instead of rented compute like Grok-2. This seems to be a wash in terms of scale, more a way of keeping the 30K H100s in use and getting experience for the subsequent 100K run. Targeting end of 2024 for Grok-3 release means it finishes pre-training in late 2024, maybe Oct-Nov 2024 (leaving some time for RLHF until end of 2024), so this is some evidence for end of 2024 as the time when 100K H100s get online, otherwise Grok-3 could be trained for longer. As it is, it's going to get about 1e26 FLOPs. Since Grok-1 was MoE (unlike Llama-3-405b), this has a chance of being better than current SOTA as of Aug 2024, but by the end of 2024 there might already be Claude 3.5 Opus or a new Gemini.

James Camacho

7-2
  1. Is this apparent parity due to a mass exodus of employees from OpenAI, Anthropic, and Google to other companies, resulting in the diffusion of "secret sauce" ideas across the industry?

No. There isn't much "secret sauce", and these companies never had a large amount of AI talent to begin with. Their advantage is being in a position with hype/reputation/size to get to market faster. It takes several months to setup the infrastructure (getting money, data, and compute clusters), but that's really the only hurdle.

  1. Does this parity exist because other companies are simply piggybacking on Meta's open-source AI model, which was made possible by Meta's massive compute resources? Now, by fine-tuning this model, can other companies quickly create models comparable to the best?

No. "Everyone" in the AI research community knew how to build Llama, multi-modal models, or video diffusion models a year before they came out. They just didn't have $10M to throw around.

Also, fine-tuning isn't really the way to go. I can imagine people using it as a teacher during the warming up phase, but the coding infrastructure doesn't really exist to fine-tune or integrate another model as part of a larger one. It's usually easier to just spend the extra time securing money and training.

  1. Is it plausible that once LLMs were validated and the core idea spread, it became surprisingly simple to build, allowing any company to quickly reach the frontier?

Yep. Even five years ago you could open a Colab notebook and train a language translation model in a couple of minutes.

  1. Are AI image generators just really simple to develop but lack substantial economic reward, leading large companies to invest minimal resources into them?

No, images are much harder than language. With language models, you can exactly model the output distribution, while the space of images is continuous and much too large for that. Instead, the best models measure the probability flow (e.g. diffusion/normalizing flows/flow-matching), and follow it towards high-probability images. However, parts of images should be discrete. You know humans have five fingers, or text has words in it, but flows assume your probabilities are continuous.

Imagine you have a distribution that looks like

__|_|_|__

A flow will round out those spikes into something closer to

_/^\/^\/^\__

which is why gibberish text or four-and-a-half fingers appear. In video models, this leads to dogs spawning and disappearing into the pack.

  1. Could it be that legal challenges in building AI are so significant that big companies are hesitant to fully invest, making it appear as if smaller companies are outperforming them?

Partly when it comes to image/video models, but this isn't a huge factor.

  1. And finally, why is OpenAI so valuable if it’s apparently so easy for other companies to build comparable tech? Conversely, why are these no name companies making leading LLMs not valued higher?

I think it's because AI is a winner-takes-all competition. It's extremely easy for customers to switch, so they all go to the best model. Since ClosedAI already has funding, compute, and infrastructure, it's risky to compete against them unless you have a new kind of model (e.g. LiquidAI), reputation (e.g. Anthropic), or are a billionaire's pet project (e.g. xAI).

Without a doubt, the question is very interesting. As it stands, it looks like there's something that doesn't fit. It would be interesting to see it from a different angle. To make matters better, it's not a race to be the first to the AGI. It's possible that what's happening is that the costs of training the new models that are in the oven are too high. The investors are thrilled to be able to say that they are the first ones to reach their goal. But don't get fooled; their main job is to make sure they get back everything they put in. If we put all of these expected costs into one equation, it's clear that the return has to be great in the medium and short term for it to be a moderately good investment. The truth is that the Top 3's sales of these models today are very low. From this point of view, all of these big companies that are mentioned in the article should be working hard to find a way to get their money back from their investments.

1 comment, sorted by Click to highlight new comments since:

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?