All of Vladimir_Nesov's Comments + Replies

The tweet links to the 3 Feb 2025 OpenAI paper that discusses specialized o1-ioi system based on o1 that competed live during IOI 2024, and compares its performance to later results with o3.

I think the most it says about the nature of the distinction between o1 and o3 is this (referring to results of o3):

As shown in Figure 5, further RL training provided a significant improvement over both o1 and the full o1-ioi system.

This suggests that o3 is based on the same base model, or even a shared RL checkpoint, but still ambiguously. So doesn't clearly rule o... (read more)

Sufficiently competent code rewriting isn't implied by R1/o3, and how much better future iterations of this technique get remains unclear, similarly to how it remains unclear how scaling pretraining using $150bn training systems cashes out in terms of capabilities. It remains possible that even after all these directions of scaling run their course, there won't yet be sufficient capabilities to self-improve in some other way.

Altman and Amodei are implying there's knowably more there in terms of some sort of scaling for test-time compute, but that could mea... (read more)

almost no difference between 180b vs 800b model, when r=1(table 4)

It's a 3B parameter model, so training it for 180B tokens already overtrains it maybe 3x, and training for 800B tokens overtrains it 13x. The loss of compute efficiency from the latter is about 1.6x more than from the former, with 4.4x more raw compute, so should have 2.7x more in effective compute, or act like a compute optimal model that's 1.6x larger, trained on 1.6x more tokens. So the distinction is smaller than 180 vs. 800.

I think this framing doesn't work, programs almost never control each other. Instead they can coordinate with each other by agreeing to follow decisions of a third program, which is identical between them, a "contract". Initially, the contract isn't yet "signed", so seeing each other's code sets up the conditions for defining a shared contract (deciding to follow what it'll say once computed).

There could be many contracts simultaneously, each weakly nudging decisions of multiple agents coordinated through them. Social norms are contracts in this sense. I t... (read more)

Whether upvotes need to be explained overall is not relevant to my comment, as I'm talking about the specific considerations named by Noah Birnbaum.

It's not yet known if there is a way of turning R1-like training into RSI with any amount of compute. This is currently gated by quantity and quality of graders for outcomes of answering questions, which resist automated development.

1Davey Morse
that's one path to RSI—where the improvement is happening to the (language) model itself. the other kind—which feels more accessible to indie developers and less explored—is an LLM (eg R1) looping in a codebase, where each loop improves the codebase itself. The LLM wouldn't be changing, but the codebase that calls it would be gaining new APIs/memory/capabilities as the LLM improves it. Such a self-improving codebase... would it be reasonable to call this an agent?

If the reasons to leave are too legible, they are either toothless or will be gamed and become too costly to actually enforce, including in injustice and drama. Trivial inconveniences that differentially apply to people that should leave anyway are still effective, but don't have these downsides.

(My own policy is to almost always avoid downvoting precisely when I have a comment to make. Otherwise the vote is all the feedback I have to give, so I'm going to give it rather than metaphorically slash their tires by staying silent and maintaining a misleading impression about the reception of their post/comment.)

These considerations also apply to upvotes (to the extent that they do).

4KvmanThinking
I don't think such considerations apply to upvotes nearly as much if at all. Upvotes indicate agreement or approval, which doesn't need to be explained as thoroughly as disagreement (which usually involves having separate, alternative ideas in your head different from the ideas of the one you are disagreeing with)

It's crucial that some people get discouraged and leave for illegible reasons, without a need for hard enforcement, which has unwieldy externalities. For almost everyone who should stay, figuring out reasons for significant downvoting is probably not very difficult. Any discussion would then be about correctness or endorsement of those reasons, not about finding out what they are.

0Seth Herd
I think you're overestimating how difficult it is for one person to guess another's thoughts. Good writing is largely a challenge of understanding different perspectives. It is hard. I'm curious why you think it's crucial for people to leave for illegible reasons in particular? I do see the need to keep the community to a good standard of average quality of contributions.
1FlorianH
Interesting. Can you elaborate why? I find it natural one should have the option to downvote anonymously & with no further explanation, but the statement still doesn't seem obvious to me.

For scaling to larger training systems, the trend is probably increasing, since larger datasets have lower quality, and soon repetition in training will become necessary, lowering quality per trained-on token. Also, MoE is a large compute multiplier (3x-6x, Figure 11 in the above MoE scaling paper), it's not going to be ignored if at all possible. There are other studies that show a decreasing trend, but this probably won't hold up in practice as we get to 250T and then 750T tokens within a few years even for a dense model.

For 1:32 MoE at 5e28 FLOPs (5 GW ... (read more)

Chinchilla's 20 tokens/param (at 6e23 FLOPs) change significantly when working with different datasets, architectures, or amounts of compute. For Llama-3-405B, it's 37 tokens/param at 4e25 FLOPs and increasing 1.5x for every 1000x of compute (Figure 3). When training on data repeated 60 times, optimal tokens/param increase about 2.5x (Figure 3).

For MoE models with 87% (1:8) sparsity, optimal tokens/param increase 3x, and at 97% (1:32) sparsity by 6x (Figure 12, left). This suggests that if Llama-3-405B was instead a MoE model with 97% sparsity, it would ha... (read more)

1harsimony
Wonderful to get more numbers on this!  These examples seem to contradict note 2 where D/N falls for larger C. Now I'm not sure what the trend should be. It feels like you could derive a rule of thumb based on the loss and the entropy of the dataset e.g. "If my model starts at a loss of 4 bits/token and the asymptote is 2 bits/token, I need X tokens of data to fully specify a model with Y bits stored in the parameters."

With 90% sparsity you do get better loss than dense, this is sufficient to broadly carry your argument. But with 98% sparsity (your llama-3-405B variant example has 95% sparsity) you might get worse loss than with 90% when data is scarce, though it'll still be better than dense. The principle about MoE damaging data efficiency (optimal tokens/param ratio) hints that this might be the case even before looking at the experiments.

Chinchilla scaling shows that tokens/params ratio for compute optimal models only changes slowly with compute, making it a good anchor to frame other things in terms of. The experiments from this MoE scaling paper show that under fixed data, varying sparsity in MoEs that are compute optimal at that amount of data preserves perplexity. This also seems like a nice principle for framing the way compute optimal models sit in the space of hyperparameters.

With infinite data, isoFLOPs for loss depending on number of active params are parabolas with some minimum p... (read more)

2ryan_greenblatt
I'm currently skeptical and more minimally, I don't understand the argument you're making. Probably not worth getting into. I do think there will be a limit to how sparse you want to even in the very high compute relative to data regime for various reasons (computational if nothing else). I don't see how these graphs support 90-95% sparsity, but I had a hard time understanding your argument. Regardless, I don't think this argues against my claim, not sure if you were trying to argue against the claim I was saying or add context. (Insofar as your argument is true, it does limit the returns from MoE in the regime with little data.)

10.5-13% on text only part of HLE

This is for o3-mini, while the ~25% figure for o3 from the tweet you linked is simply restating deep research evals.

And how much the improved reasoning is from using a different base model vs. different post-training. It's possible R1-like training didn't work for models below GPT-4 level, and then that same training started working at GPT-4 level (at which point you can iterate from a working prototype or simply distill to get it to work for weaker models). So it might work even better for the next level of scale of base models, without necessarily changing the RL part all that much.

A MoE transformer can reach the same loss as a compute optimal dense model using 3x-6x less compute, but will need the same amount of data to do it. So compute optimal MoEs don't improve data efficiency, don't contribute to mitigating data scarcity.

A new Jan 2025 paper offers straightforward compute multiplier results comparing dense transformers to MoE at various levels of sparsity, with isoFLOPs for various tokens/parameter ratios, using experiments of up to 1e21 FLOPs per datapoint. Compute multiplier results are in Figure 11, with about 3x compute mult... (read more)

5ryan_greenblatt
I agree compute optimal MoEs don't improve data utilization. But, naively you might expect that MoEs can be used to reduce issues with data scarcity at a fixed level of compute by training a much bigger model on a fixed amount of data. As in, because there are returns to both more data and bigger models, you can use MoE to effectively use a much bigger model at the same compute. Like, maybe you would have trained llama-3-405B on 15T tokens. You could instead train an 8 trillion parameter model with 400B active params on 15T tokens and a priori this could perform much better on that same amount of data. (In practice an MoE with X active params is more expensive to train than a dense model with X active params, so you might need to reduce active params somewhat.)

didn't run red-teaming and persuasion evals on the actually-final-version

Asking for this is a bit pointless, since even after the actually-final-version there will be a next update for which non-automated evals won't be redone, so it's equally reasonable to do non-automated evals only on some earlier version rather than the actually-final one.

they write: "We then apply RL training on the fine-tuned model until it achieves convergence on reasoning tasks."

Ah, I failed to take a note of that when reading the paper. My takeaway was the opposite. In Figure 2 for R1-Zero, the first impression is convergence, both from saturation of the benchmark, and in the graph apparently leveling off. But if replotted in log-steps instead of linear steps, there isn't even any leveling off for pass@1, despite near-saturation of the benchmark for cons@16: accuracy for pass@1 is 0.45 after 2K steps, 0.55 (+0.10) a... (read more)

DeepSeek-R1 ... Run RL to convergence

Not to convergence, the graphs in the paper keep going up. Which across the analogy might explain some of the change from o1 to o3 (the graphs in the o1 post also keep going up), though new graders coded for additional verifiable problems are no doubt a large part of it as well.

o3-mini has the same knowledge cutoff date as 4o and o1 (late 2023)

It seems like o1-mini is its own thing, might even start with a base model that's unrelated to GPT-4o-mini (it might be using its own specialized pretraining data mix). So ... (read more)

3MiloSal
Thanks for your comments! On page 10, when describing the training process for R1, they write: "We then apply RL training on the fine-tuned model until it achieves convergence on reasoning tasks." I refer to this.  I basically agree with your analysis of GPT-5--which is worrying for short-term scaling, as I tried to argue.

The fact that RL seems to be working well on LLMs now, without special tricks, as reported by many replications of r1, suggests to me that AGI is indeed not far off.

Still, at least as long as base model effective training compute isn't scaled another 1,000x (which is 2028-2029), this kind of RL training probably won't generalize far enough without neural (LLM) rewards, which for now don't let RL scale as much as with explicitly coded verifiers.

This is an obvious thing to try, but it's not what currently already works, and it's not certain to work without some additional ideas. You can do a little bit of this, but not nearly to the extent that o1/R1 inch towards saturating benchmarks on math/coding olympiad-like problems. So long as using LLMs as reward for scalable RL doesn't work yet, supercharged capabilities of o1/R1-like models plausibly remain restricted to verifiable tasks.

Relative to GPT-4o, which was trained at a time when 30K H100s clusters were around, and so in BF16 could be expected to be around 8e25 FLOPs, possibly overtrained to a degree that's not too different from DeepSeek-V3 itself.

Amodei's post you linked says a few of tens of millions of dollars for Claude 3.5 Sonnet, which is maybe 4e25 FLOPs in BF16, but I think Claude 3.5 Sonnet is better than DeepSeek-V3, which is not as clearly the case for GPT-4o and DeepSeek-V3, making them easier to compare. Being better than GPT-4o at 2x fewer FLOPs, Claude 3.5 Sonnet ... (read more)

Stargate is evidence towards slower training system scaling. The rumored reason for starting the project is that Microsoft isn't building giant frontier training systems fast enough, probably because they aren't seeing the case for doing that faster. In which case other hyperscalers might think similarly, and they are the most well-positioned to build these systems, so this attitude might be indicative of how frontier training systems get built overall, which is notably slower than technically feasible.

The $80bn Microsoft capex is not relevant to this if i... (read more)

From what I remember, the training-compute optimal number of experts was like 64

I think it only gets better with more experts if you keep the number of active parameters unchanged. Is there some setting where it gets worse after a while? There certainly are engineering difficulties and diminishing returns.

Also, the number of activated experts can vary (there are 8 activated routed experts in DeepSeek-V3 out of the total of 256), so "number of experts" doesn't really capture the ratio of total to activated, probably not a good anchor by itself.

Given ne

... (read more)

The bet that "makes sense" is that quality of Claude 3.6 Sonnet, GPT-4o and DeepSeek-V3 is the best that we're going to get in the next 2-3 years, and DeepSeek-V3 gets it much cheaper (less active parameters, smaller margins from open weights), also "suggesting" that quality is compute-insensitive in a large range, so there is no benefit from more compute per token.

But if quality instead improves soon (including by training DeepSeek-V3 architecture on GPT-4o compute), and that improvement either makes it necessary to use more compute per token, or motivate... (read more)

Taken in isolation, DeepSeek-V3 looks like a 15x compute multiplier. But if a lot of it is data, the multiplier won't scale (when you need much more data, it necessarily becomes worse, or alternatively you need a teacher model that's already better). In any case, this raises the ceiling for what 5 GW training systems can do (at which point there's either almost-AGI or scaling slows down a lot). And there the 15x multiplier of DeepSeek-V3 (or what remains of it after scaling) needs to be compared with the algorithmic advancements of 2025-2028, which would've included most of the things in DeepSeek-V3 anyway, so the counterfactual impact is small.

4ryan_greenblatt
15x compute multiplier relative to what? See also here.

32B active parameters instead of likely ~220B for GPT4

It's 37B instead of maybe 280B (non-expert parameters also count), but in any case the question is how this manages to maintain quality. If this wasn't an issue, why not 8B active parameters, or 1M active parameters?

32B active parameters instead of likely ~220B for GPT4 => 6.8x lower training ... cost

Doesn't follow, training cost scales with the number of training tokens. In this case DeepSeek-V3 uses maybe 1.5x-2x more tokens than original GPT-4.

The training costs are maybe 5e24 FLOPs and 2e2... (read more)

3Maxime Riché
Thanks for your corrections, that's welcome   Each of the points above is a relative comparison with more or less everything else kept constant. In this bullet point, by "training cost", I mostly had in mind "training cost per token": * 32B active parameters instead of likely ~ 220 280B for GPT4 => 6.8 8.7x lower training cost per token.    From what I remember, the training-compute optimal number of experts was like 64, given implementations a few years old (I don't remember how many activated at the same time in this old paper). Given newer implementations and aiming for inference-compute optimality, it seems logical that more than 64 experts could be great.   Right, that's why I wrote: "possibly 4x fewer training steps for the same number of tokens if predicting tokens only once" (assuming predicting 4 tokens at a time), but that's not demonstrated nor published (given my limited knowledge on this).

training on O1 outputs

Outputs of o1 don't include reasoning traces, so not particularly useful compared to outputs of chatbot models, and very expensive, so only a modest amount can be collected.

Imitation helps with post-training, but the compute-heavy part is pretraining, and obtaining good quality with little pretraining is a novel feat that isn't known to be explainable by good post-training, or by including a lot of outputs from good models in the pretraining/annealing mix.

1Hastings
I think most of the imitation happens in the pretraining. I don't know about o1, but DeepSeek v3 is at minimum trained on a ton of 4o outputs, although they are slightly cagey about this. Just the first thing I tried, I had ChatGPT write a sock poem: 4o's poem: Then I gave v3 just the first two stanzas, and asked it to continue the poem: v3's continuation: The shot in the dark guess of the "humble socks, so oft unseen... routine" couplet is a fucking soul read. v3 knows 4o, in a way that I kind of hope no model ever knows a person.
8gwern
It would be more precise to say outputs of o1 aren't supposed to include the reasoning traces. But in addition to the reasoning traces OA voluntarily released, people have been observing what seem to be leaks, and given that the history of LLM robustness to jailbreaks can be summarized as 'nil', it is at least conceivable that someone used a jailbreak+API to exfiltrate a bunch of traces. (Remember that Chinese companies like ByteDance have definitely been willfully abusing the OA API for the purposes of knowledge distillation/cloning and evading bans etc, in addition to a history of extremely cutthroat tactics that FANG would blanch at, so it's a priori entirely plausible that they would do such things.) I don't believe DeepSeek has done so, but it is technically possible. (Regardless of whether anyone has done so, it is now partially moot given that r1 traces in the DS paper, and based on third party reports thus far, work so well for distillation so everyone can kickstart their own r1-clone with r1 reasoning traces and work from there. There may be more reason to try to exfiltrate o3+ traces, but OA may also decide to not bother, as users are claiming to value and/or enjoy reading the raw traces, and since the secret & capability is out, maybe there's not much point in hiding them any longer.)

This seems unlikely to be a neglected concern, unless there are specific signs that it is.

could end up being the most important thing I’ve ever written

The $6 million is disputed by a video arguing that DeepSeek used far more compute than they admit to.

The prior reference is a Dylan Patel tweet from Nov 2024, in the wake of R1-Lite-Preview release:

Deepseek has over 50k Hopper GPUs to be clear.
People need to stop acting like they only have that 10k A100 cluster.
They are omega cracked on ML research and infra management but they aren't doing it with that many fewer GPUs

DeepSeek explicitly states that

DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training.

This seems unlikely to be ... (read more)

1Knight Lee
I see, thank you for the info! I don't actually know about DeepSeek V3, I just felt "if I pointed out the $6 million claim in my argument, I shouldn't hide the fact I watched a video which made myself doubt it." I wanted to include the video as a caveat just in case the $6 million was wrong. Your explanation suggests the $6 million is still in the ballpark (for the final training run), so the concerns about a "software only singularity" are still very realistic.

Found the following in the Jan 23 newsletter:

AI doesn’t accelerate my writing much, although it is often helpful in parsing papers and helping me think through things. But it’s a huge multiplier on my coding, like more than 10x.

What actually happens with xAI and Anthropic compute by end of 2025 is less clear. For xAI, 300K B200s figure was mentioned in June 2024. For Anthropic, Amodei said in a recent interview that

I would not be surprised if in 2026 we have more than a million of some kind of chip.

Meanwhile, xAI will have a 200K H100/H200 system, and Anthropic a 400K Trn2 system, which is about 250K H100s worth of FLOP/s (ready by a few months into 2025). The 400-600 MW at Abilene site for OpenAI are 200K-300K B200s, which is about 500K-750K H100s worth of FLOP/s.

Google might start 2026 with the largest training system among the big labs, by a factor of about 2x, at about 1 GW.

OpenAI/Microsoft Stargate schism suggests that compute being built this year by Microsoft is unlikely to form part of a geographically distributed training system that also includes compute being built at Abilene site. Seems like OpenAI will be building its own training systems (through Stargate), while Microsoft will be serving inference (and possibly generation for RL training, but it remains unclear if it can be an important fraction of pr... (read more)

2Lorenzo
For context, average US electricity consumption in 2022 was ~500GW. So these would be ~1% of all US electricity consumption (as an order of magnitude)
5anaguma
How does Anthropic and XAi’s compute compare over this period?

What can be done for $6 million, can be done even better with 6 million GPUs[1]. What can be done with 6 million GPUs, can't be done for $6 million. Giant training systems are the moat.


  1. H/t Gwern. ↩︎

1rahulxyz
Yeah, in one sense that makes sense. But also, NVDA is down ~16% today.

By "3rd person perspective" I mean considering the world itself, there is no actual third person needed for it. It's the same framing as used by a physicist when talking about the early stages of the universe when humans were not yet around, or when talking about a universe with alternative laws of physics, or when talking about a small system that doesn't include any humans as its part. Or when a mathematician talks about a curve on a plane.

Knowing absolutely everything is not necessary to know the relevant things, and in this case we know all the people ... (read more)

How did DeepSeek accidentally happen to invest precisely the amount of compute into V3 and r1 that would get them into the capability region of GPT-4/o1

Selection effect. If DeepSeek-V2.5 was this good, we would be talking about it instead.

GPT-4 was supposedly trained for $100 million, and V3 for $5.5 million

Original GPT-4 is 2e25 FLOPs and compute optimal, V3 is about 5e24 FLOPs and overtrained (400 tokens/parameter, about 10x-20x), so a compute optimal model with the same architecture would only need about 3e24 FLOPs of raw compute[1]. Original GPT... (read more)

all copies ... will claim to be the original ... regardless of whether they are the original

Not if they endorse Litany of Tarski and understand the thought experiment!

Any "perceive yourself to X" phenomenon is something that happens within cognition of some abstract agent/person instance, whether they exist in some world or not. What kind of person instance is "perceiving themselves to black out" (that is, having blacked out)? Ghosts and afterlife seem more grounded than that. But for Earth/Mars question, both options are quite clear, and there is a you that perceives either of them in some of the possibilities, we can point to where those that perceive each of them are, and that is what would be correct for those instances to conclude about themselves, that they exist in the situations that contain them, known from the statement of the thought experiment.

1green_leaf
It's not a person instance, it's an event that happens to the person's stream of consciousness. Either the stream of consciousness truly, objectively ends, and a same-pattern copy will appear on Mars, mistakenly believing they're the very same stream-of-consciousness as that of the original person. Or the stream is truly, objectively preserved, and the person can calmly enter, knowing that their consciousness will continue on Mars. I don't think a 3rd-person analysis answers this question. (With the correct answer being, of course, that the stream is truly, objectively preserved.) Since I don't think a 3rd person analysis answers the original problem, I also don't think it answers it in case we massively complicate it like the OP has. (Edited for clarity.)

A 3rd person perspective is there anyway, can be used regardless, even if other perspectives are also applicable. In this case it explains everything already, so we can't learn additional things in other ways.

2avturchin
The 3rd person perspective assumes the existence (or at least possibility) of some observer X who knows everything and can observe how events evolve across all branches. However, this idea assumes that this observer X will be singular and unique, will continue to exist as one entity, and will linearly collect information about unfolding events. These assumptions clearly relate to ideas of personal identity and copying: it is assumed that X exists continuously in time and cannot be copied. Otherwise, there would be several 3rd person perspectives with different observations. This concept can be better understood through real physical experiments: an experiment can only be performed if the experimenter exists continuously and is not replaced by another experimenter midway through. 
3green_leaf
Does the 3rd person perspective explain if you survive a teleporter, or if you perceive yourself to black out forever (like after a car accident)?

There is a full explanation right there, in the description of the thought experiment. It describes all outcomes, including all observations and theoretical conclusions made by all the people-instances. We can look at this and ask whether those theoretical conclusions are correct, whether the theories the people-instances use to arrive at them are valid. You can tell what all the details of outcomes are in advance of actually doing this.

Personal experimence of people existing in the world is mediated by the physical states of their brains (or other physica... (read more)

2avturchin
  Interesting. Can you elaborate?

One you in the worlds with total weight of 0.001 will observe remaining on Earth, while either the exact or approximate you in the worlds with total weight of 1.000 will observe arriving on Mars. That is all that actually happens.

Then they'll start making strange proclamations about their newfound epistemic states and empirical observations from the personal observation stream relevant to theories of identity, but that's beside the point.

1green_leaf
That only seems to make sense if the next instant of subjective experience is undefined in these situations (and so we have to default to a 3rd person perspective).
5avturchin
Your comment can be interpreted as a statement that theories of identity are meaningless. If they are meaningless, then copy=original view prevails. From the third-person point of view, there is no difference between copy and original. In that case, there is no need to perform the experiment. 

Advanced capabilities can be squeezed into small, efficient models that can run on commodity hardware.

This could also work for general intelligence and not only narrow math/coding olympiad sort of problems. The potential of o1/R1 is plausibly constrained for now by ability to construct oracle verifiers for correctness of solutions, which mostly only works for toy technical problems. Capabilities on such problems are not very likely to generalize to general capabilities, there aren't clear signs so far that this is happening.

But this is a constraint on h... (read more)

1otto.barten
Maybe we can regulate data generation?

This was my understanding pre r1. Certainly this seems to be the case with the o1 models: better at code and math, not better at philosophy and creative writing.

But something is up with r1. It is unusually good at creative writing. It doesn't seem spikey in the way that I predicted.

I notice I am confused.

Possible explanation: r1 seems to have less restrictive 'guardrails' added using post-training. Perhaps this 'light hand at the tiller' results in not post-training it towards mode-collapse. It's closer to a raw base model than the o1 models.

This is just a hypothesis. There are many unknowns to be investigated.

it took people about 8 months to accelerate Andrej Karpathy's PyTorch GPT-2 trainer from llm.c by 14x on a 124M parameter GPT-2

The baseline is weak, the 8 months is just catching up to the present. They update the architecture (giving maybe a 4x compute multiplier), shift to a more compute optimal tokens/parameter ratio (1.5x multiplier). Maybe there is another 2x from the more obscure changes (which are still in the literature, so the big labs have the opportunity to measure how useful they are, select what works).

It's much harder to improve on GPT-4 o... (read more)

There is a difference in external behavior only if you need to communicate knowledge about the environment and the other players explicitly. If this knowledge is already part of an agent (or rock), there is no behavior of learning it, and so no explicit dependence on its observation. Yet still there is a difference in how one should interact with such decision-making algorithms.

I think this describes minds/models better (there are things they've learned long ago in obscure ways and now just know) than learning that establishes explicit dependence of actions on observed knowledge in behavior (which is more like in-context learning).

What distinguishes a cooperate-rock from an agent that cooperates in coordination with others is the decision-making algorithm. Facts about this algorithm also govern the way outcome can be known in advance or explained in hindsight, how for a cooperate-rock it's always "cooperate", while for a coordinated agent it depends on how others reason, on their decision-making algorithms.

So in the same way that Newcomblike problems are the norm, so is the "unfair" interaction with decision-making algorithms. I think it's just a very technical assumption that doesn't make sense conceptually and shouldn't be framed as "unfairness".

2quetzal_rainbow
More technical definition of "fairness" here is that environment doesn't distinguish between algorithms with same policies, i.e. mappings <prior, observation_history> -> action? I think it captures difference between CooperateBot and FairBot. As I understand, "fairness" was invented as responce to statement that it's rational to two-box and Omega just rewards irrationality.

Training frontier models needs a lot of chips, situations where "a chip notices something" (and any self-destruct type things) are unimportant because you can test on fewer chips and do it differently next time. Complicated ways of circumventing verification or resetting clocks are not useful if they are too artisan, they need to be applied to chips in bulk and those chips then need to be able to work for weeks in a datacenter without further interventions (that can't be made into part of the datacenter).

AI accelerator chips have 80B+ transistors, much mor... (read more)

2jamesian
My guess is that AI accelerators will have some difficult-to-modify persistent memory based on similar chips having it, but I'm not sure if it would be on the same die or not. I wrote more about how a firmware-based implementation of Offline Licensing might use H100 secure memory, clocks, and secure boot here: https://arxiv.org/abs/2404.18308 

Chips have 15+ metal interconnect layers, so if verification is placed sufficiently all over the place physically, it probably can't be circumvented. I'm guessing a more challenging problem is replay attacks, where the chip needs some sort of persistent internal clocks or counters that can't be reset to start in order to repeatedly reuse old (but legitimate) certificates that enabled some computations at some point in the past.

1Yonatan Cale
Thanks! Could you say more about your confidence in this?   Yes, specifically I don't want an attacker to reliably be able to reset it to whatever value it had when it sent the last challenge. If the attacker can only reset this memory to 0 (for example, by unplugging it) - then the chip can notice that's suspicious. Another option is a reliable wall clock (though this seems less promising). I think @jamesian told me about a reliable clock (in the sense of the clock signal used by chips, not a wall clock), I'll ask

You don't survive for anthropic reasons. Anthropic reasons explain the situations where you happen to survive by blind luck.

1Embee
Can you tell me your p(doom) and AGI timeline? Cause I think we can theoretically settle this: I give you x$ now and in y years you give me back x times r $ back Please tell me acceptable y, r for you (ofc in the sense of least-convenient-but-still-profitable)
1Embee
Feels deep but I don't get it. Would you mind elaborating?

for example Zvi insisting that anyone who is not using LLMs to 10x their productivity is not serious ... a vibe not a direct quote

I expect he'd disagree, for example I vaguely recall him mentioning that LLMs are not useful in a productivity-changing way for his own work. And 10x specifically seems clearly too high for most things even where LLMs are very useful, other bottlenecks will dominate before that happens.

1Cole Wyeth
10x was probably too strong but his posts are very clear he things it's a large productivity multiplier. I'll try to remember to link the next instance I see. 

IsoFLOP curves for dependence of perplexity on log-data seem mostly symmetric (as in Figure 2 of Llama 3 report), so overtraining by 10x probably has about the same effect as undertraining by 10x. Starting with a compute optimal model, increasing its data 10x while decreasing its active parameters 3x (making it 30x overtrained, using 3x more compute) preserves perplexity (see Figure 1).

GPT-3 is a 3e23 FLOPs dense transformer with 175B parameters trained for 300B tokens (see Table D.1). If Chinchilla's compute optimal 20 tokens/parameter is approximately co... (read more)

Load More