Josh You

data analyst at Epoch AI

@justjoshinyou13 on twitter

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

AI has probably increased valuations for Big Tech (particularly Nvidia) by at least a few trillion over the past two years. So part of this is that investors think OpenAI/Anthropic will only capture around 10% of total AI profits.

65T tokens doesn't get you to 1e26 FLOP with 100B active params? You'd need well over 100T tokens: 6 * 100 billion * 65 trillion is 3.9e25 FLOP.

GPT-4.5 being trained on fewer tokens than GPT-4o doesn't really make sense. GPT-4.5 only having 5x more active params than GPT-4o doesn't quite make sense either, though I'm not as confident that's wrong.

1e26 FLOP would have had a significant opportunity cost. Remember that OpenAI was and is very GPU constrained and may have valued GPU hours in a large-scale cluster a lot more than $2/hour. It would be worth it to make your flagship model good, but not worth it if it barely has any effect on your flagship model. I don't think it's a good idea to reason backwards from alleging some compute budget that OpenAI might have had at X date, to inferring the training FLOP of a model trained then.

I don't think GPT-4o was trained on 1e26 FLOP or particularly close to it. Overtraining is common but GPT-4o being overtrained by 10x for 1e26 FLOP is kind of a strong and surprising claim (some models like Llama 3 8b are extremely overtrained but they're small so this overtraining is cheap). I think a more natural explanation is that it improves on GPT-4 because of superior post-training and other innovations.

Josh You*40

The high cost and slow speed of GPT-4.5 seems like a sign OpenAI is facing data constraints, though we don't actually know the parameters and OpenAI might be charging an bigger margin than usual (it's a "research preview" not a flagship commercial product). If data was more abundant, wouldn't GPT-4.5 be more overtrained and have fewer parameters? 

edit: FWIW Artificial Analysis measures GPT-4.5 at a not-that-bad 50 tokens per second whereas I've been experiencing a painfully slow 10-20 tokens/second in the chat app. So may just be growing pains until they get more inference GPUs online. But OpenAI does call it a "chonky" model, implying significant parameter scaling. 

Josh You108

if OpenAI follows the usual naming convention of roughly 100x in raw compute.

I doubt this is a real convention. I think OpenAI wanted to call Orion GPT-5 if they thought it was good enough to deserve the name.

In Holden Karnofsky's "AI Could Defeat All Of Us Combined" a plausible existential risk threat model is described, in which a swarm of human-level AIs outmanoeuvre humans due to AI's faster cognitive speeds and improved coordination, rather than qualitative superintelligence capabilities. This scenario is predicated on the belief that "once the first human-level AI system is created, whoever created it could use the same computing power it took to create it in order to run several hundred million copies for about a year each." If the first AGIs are as expensive to run as o3-high (costing ~$3k/task), this threat model seems much less plausible.

I wonder how different the reasoning paradigm is, actually, from the picture presented here. After all, running a huge number of AI copies in parallel is... scaling up test-time compute. 

The overhang argument is a rough analogy anyway. I think you are invoking the intuition of replacing the AI equivalent of a very large group of typical humans with the AI equivalent of a small number of ponderous geniuses, but those analogies are going to be highly imperfect in practice.

Josh You*50

By several reports, (e.g. here and here) OpenAI is throwing enormous amounts of training compute at o-series models. And if the new RL paradigm involves more decentralized training compute than the pretraining paradigm, that could lead to more consolidation into a few players, not less, because pretraining* is bottlenecked by the size of the largest cluster. E.g. OpenAI's biggest single compute cluster is similar in size to xAI's, even though OpenAI has access to much more compute overall. But if it's just about who has the most compute then the biggest players will win.

*though pretraining will probably shift to distributed training eventually

AI systems can presumably be given at least as much access to company data as human employees at that company. So if rapidly scaling up the number and quality of human workers at a given company would be transformative,  AI agents with >=human-level intelligence can also be transformative.

I think a little more explanation is required on why there isn't already a model with 5-10x* more compute than GPT-4 (which would be "4.5 level" given that GPT version numbers have historically gone up by 1 for every two OOMs, though I think the model literally called GPT-5 will only be a roughly 10x scale-up). 

You'd need around 100,000 H100s (or maybe somewhat fewer; Llama 3.1 was 2x GPT-4 and trained using 16,000 H100s) to train a model at 10x GPT-4.  This has been available to the biggest hyperscalers since sometime last year. Naively it might take ~9 months from taking delivery of chips to releasing a model (perhaps 3 months to set up the cluster, 3 months for pre-training, 3 months of post-training, evaluations, etc). But most likely the engineering challenges in building a cluster that big, which is unprecedented, and perhaps high demand for inference, has prevented them from concentrating that much compute into one training run in time to release a model by now.

*I'm not totally sure the 5x threshold (1e26 FLOP) hasn't been breached but most people think it hasn't. 

Josh You178

Llama 405B was trained on a bunch of synthetic data in post-training for coding, long-context prompts, and tool use (see section 4.3 of the paper).

Load More