Sufficiently competent code rewriting isn't implied by R1/o3, and how much better future iterations of this technique get remains unclear, similarly to how it remains unclear how scaling pretraining using $150bn training systems cashes out in terms of capabilities. It remains possible that even after all these directions of scaling run their course, there won't yet be sufficient capabilities to self-improve in some other way.
Altman and Amodei are implying there's knowably more there in terms of some sort of scaling for test-time compute, but that could mea...
almost no difference between 180b vs 800b model, when r=1(table 4)
It's a 3B parameter model, so training it for 180B tokens already overtrains it maybe 3x, and training for 800B tokens overtrains it 13x. The loss of compute efficiency from the latter is about 1.6x more than from the former, with 4.4x more raw compute, so should have 2.7x more in effective compute, or act like a compute optimal model that's 1.6x larger, trained on 1.6x more tokens. So the distinction is smaller than 180 vs. 800.
I think this framing doesn't work, programs almost never control each other. Instead they can coordinate with each other by agreeing to follow decisions of a third program, which is identical between them, a "contract". Initially, the contract isn't yet "signed", so seeing each other's code sets up the conditions for defining a shared contract (deciding to follow what it'll say once computed).
There could be many contracts simultaneously, each weakly nudging decisions of multiple agents coordinated through them. Social norms are contracts in this sense. I t...
If the reasons to leave are too legible, they are either toothless or will be gamed and become too costly to actually enforce, including in injustice and drama. Trivial inconveniences that differentially apply to people that should leave anyway are still effective, but don't have these downsides.
(My own policy is to almost always avoid downvoting precisely when I have a comment to make. Otherwise the vote is all the feedback I have to give, so I'm going to give it rather than metaphorically slash their tires by staying silent and maintaining a misleading impression about the reception of their post/comment.)
It's crucial that some people get discouraged and leave for illegible reasons, without a need for hard enforcement, which has unwieldy externalities. For almost everyone who should stay, figuring out reasons for significant downvoting is probably not very difficult. Any discussion would then be about correctness or endorsement of those reasons, not about finding out what they are.
For scaling to larger training systems, the trend is probably increasing, since larger datasets have lower quality, and soon repetition in training will become necessary, lowering quality per trained-on token. Also, MoE is a large compute multiplier (3x-6x, Figure 11 in the above MoE scaling paper), it's not going to be ignored if at all possible. There are other studies that show a decreasing trend, but this probably won't hold up in practice as we get to 250T and then 750T tokens within a few years even for a dense model.
For 1:32 MoE at 5e28 FLOPs (5 GW ...
Chinchilla's 20 tokens/param (at 6e23 FLOPs) change significantly when working with different datasets, architectures, or amounts of compute. For Llama-3-405B, it's 37 tokens/param at 4e25 FLOPs and increasing 1.5x for every 1000x of compute (Figure 3). When training on data repeated 60 times, optimal tokens/param increase about 2.5x (Figure 3).
For MoE models with 87% (1:8) sparsity, optimal tokens/param increase 3x, and at 97% (1:32) sparsity by 6x (Figure 12, left). This suggests that if Llama-3-405B was instead a MoE model with 97% sparsity, it would ha...
With 90% sparsity you do get better loss than dense, this is sufficient to broadly carry your argument. But with 98% sparsity (your llama-3-405B variant example has 95% sparsity) you might get worse loss than with 90% when data is scarce, though it'll still be better than dense. The principle about MoE damaging data efficiency (optimal tokens/param ratio) hints that this might be the case even before looking at the experiments.
Chinchilla scaling shows that tokens/params ratio for compute optimal models only changes slowly with compute, making it a good anchor to frame other things in terms of. The experiments from this MoE scaling paper show that under fixed data, varying sparsity in MoEs that are compute optimal at that amount of data preserves perplexity. This also seems like a nice principle for framing the way compute optimal models sit in the space of hyperparameters.
With infinite data, isoFLOPs for loss depending on number of active params are parabolas with some minimum p...
And how much the improved reasoning is from using a different base model vs. different post-training. It's possible R1-like training didn't work for models below GPT-4 level, and then that same training started working at GPT-4 level (at which point you can iterate from a working prototype or simply distill to get it to work for weaker models). So it might work even better for the next level of scale of base models, without necessarily changing the RL part all that much.
A MoE transformer can reach the same loss as a compute optimal dense model using 3x-6x less compute, but will need the same amount of data to do it. So compute optimal MoEs don't improve data efficiency, don't contribute to mitigating data scarcity.
A new Jan 2025 paper offers straightforward compute multiplier results comparing dense transformers to MoE at various levels of sparsity, with isoFLOPs for various tokens/parameter ratios, using experiments of up to 1e21 FLOPs per datapoint. Compute multiplier results are in Figure 11, with about 3x compute mult...
didn't run red-teaming and persuasion evals on the actually-final-version
Asking for this is a bit pointless, since even after the actually-final-version there will be a next update for which non-automated evals won't be redone, so it's equally reasonable to do non-automated evals only on some earlier version rather than the actually-final one.
they write: "We then apply RL training on the fine-tuned model until it achieves convergence on reasoning tasks."
Ah, I failed to take a note of that when reading the paper. My takeaway was the opposite. In Figure 2 for R1-Zero, the first impression is convergence, both from saturation of the benchmark, and in the graph apparently leveling off. But if replotted in log-steps instead of linear steps, there isn't even any leveling off for pass@1, despite near-saturation of the benchmark for cons@16: accuracy for pass@1 is 0.45 after 2K steps, 0.55 (+0.10) a...
DeepSeek-R1 ... Run RL to convergence
Not to convergence, the graphs in the paper keep going up. Which across the analogy might explain some of the change from o1 to o3 (the graphs in the o1 post also keep going up), though new graders coded for additional verifiable problems are no doubt a large part of it as well.
o3-mini has the same knowledge cutoff date as 4o and o1 (late 2023)
It seems like o1-mini is its own thing, might even start with a base model that's unrelated to GPT-4o-mini (it might be using its own specialized pretraining data mix). So ...
The fact that RL seems to be working well on LLMs now, without special tricks, as reported by many replications of r1, suggests to me that AGI is indeed not far off.
Still, at least as long as base model effective training compute isn't scaled another 1,000x (which is 2028-2029), this kind of RL training probably won't generalize far enough without neural (LLM) rewards, which for now don't let RL scale as much as with explicitly coded verifiers.
This is an obvious thing to try, but it's not what currently already works, and it's not certain to work without some additional ideas. You can do a little bit of this, but not nearly to the extent that o1/R1 inch towards saturating benchmarks on math/coding olympiad-like problems. So long as using LLMs as reward for scalable RL doesn't work yet, supercharged capabilities of o1/R1-like models plausibly remain restricted to verifiable tasks.
Relative to GPT-4o, which was trained at a time when 30K H100s clusters were around, and so in BF16 could be expected to be around 8e25 FLOPs, possibly overtrained to a degree that's not too different from DeepSeek-V3 itself.
Amodei's post you linked says a few of tens of millions of dollars for Claude 3.5 Sonnet, which is maybe 4e25 FLOPs in BF16, but I think Claude 3.5 Sonnet is better than DeepSeek-V3, which is not as clearly the case for GPT-4o and DeepSeek-V3, making them easier to compare. Being better than GPT-4o at 2x fewer FLOPs, Claude 3.5 Sonnet ...
Stargate is evidence towards slower training system scaling. The rumored reason for starting the project is that Microsoft isn't building giant frontier training systems fast enough, probably because they aren't seeing the case for doing that faster. In which case other hyperscalers might think similarly, and they are the most well-positioned to build these systems, so this attitude might be indicative of how frontier training systems get built overall, which is notably slower than technically feasible.
The $80bn Microsoft capex is not relevant to this if i...
From what I remember, the training-compute optimal number of experts was like 64
I think it only gets better with more experts if you keep the number of active parameters unchanged. Is there some setting where it gets worse after a while? There certainly are engineering difficulties and diminishing returns.
Also, the number of activated experts can vary (there are 8 activated routed experts in DeepSeek-V3 out of the total of 256), so "number of experts" doesn't really capture the ratio of total to activated, probably not a good anchor by itself.
...Given ne
The bet that "makes sense" is that quality of Claude 3.6 Sonnet, GPT-4o and DeepSeek-V3 is the best that we're going to get in the next 2-3 years, and DeepSeek-V3 gets it much cheaper (less active parameters, smaller margins from open weights), also "suggesting" that quality is compute-insensitive in a large range, so there is no benefit from more compute per token.
But if quality instead improves soon (including by training DeepSeek-V3 architecture on GPT-4o compute), and that improvement either makes it necessary to use more compute per token, or motivate...
Taken in isolation, DeepSeek-V3 looks like a 15x compute multiplier. But if a lot of it is data, the multiplier won't scale (when you need much more data, it necessarily becomes worse, or alternatively you need a teacher model that's already better). In any case, this raises the ceiling for what 5 GW training systems can do (at which point there's either almost-AGI or scaling slows down a lot). And there the 15x multiplier of DeepSeek-V3 (or what remains of it after scaling) needs to be compared with the algorithmic advancements of 2025-2028, which would've included most of the things in DeepSeek-V3 anyway, so the counterfactual impact is small.
32B active parameters instead of likely ~220B for GPT4
It's 37B instead of maybe 280B (non-expert parameters also count), but in any case the question is how this manages to maintain quality. If this wasn't an issue, why not 8B active parameters, or 1M active parameters?
32B active parameters instead of likely ~220B for GPT4 => 6.8x lower training ... cost
Doesn't follow, training cost scales with the number of training tokens. In this case DeepSeek-V3 uses maybe 1.5x-2x more tokens than original GPT-4.
The training costs are maybe 5e24 FLOPs and 2e2...
training on O1 outputs
Outputs of o1 don't include reasoning traces, so not particularly useful compared to outputs of chatbot models, and very expensive, so only a modest amount can be collected.
Imitation helps with post-training, but the compute-heavy part is pretraining, and obtaining good quality with little pretraining is a novel feat that isn't known to be explainable by good post-training, or by including a lot of outputs from good models in the pretraining/annealing mix.
This seems unlikely to be a neglected concern, unless there are specific signs that it is.
could end up being the most important thing I’ve ever written
The $6 million is disputed by a video arguing that DeepSeek used far more compute than they admit to.
The prior reference is a Dylan Patel tweet from Nov 2024, in the wake of R1-Lite-Preview release:
Deepseek has over 50k Hopper GPUs to be clear.
People need to stop acting like they only have that 10k A100 cluster.
They are omega cracked on ML research and infra management but they aren't doing it with that many fewer GPUs
DeepSeek explicitly states that
DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training.
This seems unlikely to be ...
Found the following in the Jan 23 newsletter:
AI doesn’t accelerate my writing much, although it is often helpful in parsing papers and helping me think through things. But it’s a huge multiplier on my coding, like more than 10x.
What actually happens with xAI and Anthropic compute by end of 2025 is less clear. For xAI, 300K B200s figure was mentioned in June 2024. For Anthropic, Amodei said in a recent interview that
I would not be surprised if in 2026 we have more than a million of some kind of chip.
Meanwhile, xAI will have a 200K H100/H200 system, and Anthropic a 400K Trn2 system, which is about 250K H100s worth of FLOP/s (ready by a few months into 2025). The 400-600 MW at Abilene site for OpenAI are 200K-300K B200s, which is about 500K-750K H100s worth of FLOP/s.
Google might start 2026 with the largest training system among the big labs, by a factor of about 2x, at about 1 GW.
OpenAI/Microsoft Stargate schism suggests that compute being built this year by Microsoft is unlikely to form part of a geographically distributed training system that also includes compute being built at Abilene site. Seems like OpenAI will be building its own training systems (through Stargate), while Microsoft will be serving inference (and possibly generation for RL training, but it remains unclear if it can be an important fraction of pr...
By "3rd person perspective" I mean considering the world itself, there is no actual third person needed for it. It's the same framing as used by a physicist when talking about the early stages of the universe when humans were not yet around, or when talking about a universe with alternative laws of physics, or when talking about a small system that doesn't include any humans as its part. Or when a mathematician talks about a curve on a plane.
Knowing absolutely everything is not necessary to know the relevant things, and in this case we know all the people ...
How did DeepSeek accidentally happen to invest precisely the amount of compute into V3 and r1 that would get them into the capability region of GPT-4/o1
Selection effect. If DeepSeek-V2.5 was this good, we would be talking about it instead.
GPT-4 was supposedly trained for $100 million, and V3 for $5.5 million
Original GPT-4 is 2e25 FLOPs and compute optimal, V3 is about 5e24 FLOPs and overtrained (400 tokens/parameter, about 10x-20x), so a compute optimal model with the same architecture would only need about 3e24 FLOPs of raw compute[1]. Original GPT...
Any "perceive yourself to X" phenomenon is something that happens within cognition of some abstract agent/person instance, whether they exist in some world or not. What kind of person instance is "perceiving themselves to black out" (that is, having blacked out)? Ghosts and afterlife seem more grounded than that. But for Earth/Mars question, both options are quite clear, and there is a you that perceives either of them in some of the possibilities, we can point to where those that perceive each of them are, and that is what would be correct for those instances to conclude about themselves, that they exist in the situations that contain them, known from the statement of the thought experiment.
There is a full explanation right there, in the description of the thought experiment. It describes all outcomes, including all observations and theoretical conclusions made by all the people-instances. We can look at this and ask whether those theoretical conclusions are correct, whether the theories the people-instances use to arrive at them are valid. You can tell what all the details of outcomes are in advance of actually doing this.
Personal experimence of people existing in the world is mediated by the physical states of their brains (or other physica...
One you in the worlds with total weight of 0.001 will observe remaining on Earth, while either the exact or approximate you in the worlds with total weight of 1.000 will observe arriving on Mars. That is all that actually happens.
Then they'll start making strange proclamations about their newfound epistemic states and empirical observations from the personal observation stream relevant to theories of identity, but that's beside the point.
Advanced capabilities can be squeezed into small, efficient models that can run on commodity hardware.
This could also work for general intelligence and not only narrow math/coding olympiad sort of problems. The potential of o1/R1 is plausibly constrained for now by ability to construct oracle verifiers for correctness of solutions, which mostly only works for toy technical problems. Capabilities on such problems are not very likely to generalize to general capabilities, there aren't clear signs so far that this is happening.
But this is a constraint on h...
This was my understanding pre r1. Certainly this seems to be the case with the o1 models: better at code and math, not better at philosophy and creative writing.
But something is up with r1. It is unusually good at creative writing. It doesn't seem spikey in the way that I predicted.
I notice I am confused.
Possible explanation: r1 seems to have less restrictive 'guardrails' added using post-training. Perhaps this 'light hand at the tiller' results in not post-training it towards mode-collapse. It's closer to a raw base model than the o1 models.
This is just a hypothesis. There are many unknowns to be investigated.
it took people about 8 months to accelerate Andrej Karpathy's PyTorch GPT-2 trainer from llm.c by 14x on a 124M parameter GPT-2
The baseline is weak, the 8 months is just catching up to the present. They update the architecture (giving maybe a 4x compute multiplier), shift to a more compute optimal tokens/parameter ratio (1.5x multiplier). Maybe there is another 2x from the more obscure changes (which are still in the literature, so the big labs have the opportunity to measure how useful they are, select what works).
It's much harder to improve on GPT-4 o...
There is a difference in external behavior only if you need to communicate knowledge about the environment and the other players explicitly. If this knowledge is already part of an agent (or rock), there is no behavior of learning it, and so no explicit dependence on its observation. Yet still there is a difference in how one should interact with such decision-making algorithms.
I think this describes minds/models better (there are things they've learned long ago in obscure ways and now just know) than learning that establishes explicit dependence of actions on observed knowledge in behavior (which is more like in-context learning).
What distinguishes a cooperate-rock from an agent that cooperates in coordination with others is the decision-making algorithm. Facts about this algorithm also govern the way outcome can be known in advance or explained in hindsight, how for a cooperate-rock it's always "cooperate", while for a coordinated agent it depends on how others reason, on their decision-making algorithms.
So in the same way that Newcomblike problems are the norm, so is the "unfair" interaction with decision-making algorithms. I think it's just a very technical assumption that doesn't make sense conceptually and shouldn't be framed as "unfairness".
Training frontier models needs a lot of chips, situations where "a chip notices something" (and any self-destruct type things) are unimportant because you can test on fewer chips and do it differently next time. Complicated ways of circumventing verification or resetting clocks are not useful if they are too artisan, they need to be applied to chips in bulk and those chips then need to be able to work for weeks in a datacenter without further interventions (that can't be made into part of the datacenter).
AI accelerator chips have 80B+ transistors, much mor...
Chips have 15+ metal interconnect layers, so if verification is placed sufficiently all over the place physically, it probably can't be circumvented. I'm guessing a more challenging problem is replay attacks, where the chip needs some sort of persistent internal clocks or counters that can't be reset to start in order to repeatedly reuse old (but legitimate) certificates that enabled some computations at some point in the past.
for example Zvi insisting that anyone who is not using LLMs to 10x their productivity is not serious ... a vibe not a direct quote
I expect he'd disagree, for example I vaguely recall him mentioning that LLMs are not useful in a productivity-changing way for his own work. And 10x specifically seems clearly too high for most things even where LLMs are very useful, other bottlenecks will dominate before that happens.
IsoFLOP curves for dependence of perplexity on log-data seem mostly symmetric (as in Figure 2 of Llama 3 report), so overtraining by 10x probably has about the same effect as undertraining by 10x. Starting with a compute optimal model, increasing its data 10x while decreasing its active parameters 3x (making it 30x overtrained, using 3x more compute) preserves perplexity (see Figure 1).
GPT-3 is a 3e23 FLOPs dense transformer with 175B parameters trained for 300B tokens (see Table D.1). If Chinchilla's compute optimal 20 tokens/parameter is approximately co...
The tweet links to the 3 Feb 2025 OpenAI paper that discusses specialized o1-ioi system based on o1 that competed live during IOI 2024, and compares its performance to later results with o3.
I think the most it says about the nature of the distinction between o1 and o3 is this (referring to results of o3):
This suggests that o3 is based on the same base model, or even a shared RL checkpoint, but still ambiguously. So doesn't clearly rule o... (read more)