Thane Ruthenis

Wiki Contributions

Comments

Sorted by

Not for math benchmarks. Here's one way it can "cheat" at them: suppose that the CoT would involve the model generating candidate proofs/derivations, then running an internal (learned, not hard-coded) proof verifier on them, and either rejecting the candidate proof and trying to generate a new one, or outputting it. We know that this is possible, since we know that proof verifiers can be compactly specified.

This wouldn't actually show "agency" and strategic thinking of the kinds that might generalize to open-ended domains and "true" long-horizon tasks. In particular, this would mostly fail the condition (2) from my previous comment.

Something more open-ended and requiring "research taste" would be needed. Maybe a comparable performance on METR's benchmark would work for this (i. e., the model can beat a significantly larger fraction of it at 1 billion tokens compared to 1 million)? Or some other benchmark that comes closer to evaluating real-world performance.

Edit: Oh, math-benchmark performance would convince me if we get access to a CoT sample and it shows that the model doesn't follow the above "cheating" approach, but instead approaches the problem strategically (in some sense). (Which would also require this CoT not to be hopelessly steganographied, obviously.)

Prooobably ~simultaneously, but I can maybe see it coming earlier and in a way that isn't wholly convincing to me. In particular, it would still be a fixed-length task; much longer-length than what the contemporary models can reliably manage today, but still hackable using poorly-generalizing "agency templates" instead of fully general "compact generators of agenty behavior" (which I speculate humans to have and RL'd LLMs not to). It would be some evidence in favor of "AI can accelerate AI R&D", but not necessarily "LLMs trained via SSL+RL are AGI-complete".

Actually, I can also see it coming later. For example, some suppose that the capability researchers invent some method for reliably-and-indefinitely extending the amount of serial computations a reasoning model can productively make use of, but the compute or memory requirements grow very fast with the length of a CoT. Some fairly solid empirical evidence and theoretical arguments in favor of boundless scaling can appear quickly, well before the algorithms are made optimal enough to (1) handle weeks-long CoTs and/or (2) allow wide adoption (thus making it available to you).

I think the second scenario is more plausible, actually.

I wish we had something to bet on better than "inventing a new field of science,"

I've thought of one potential observable that is concrete, should be relatively low-capability, and should provoke a strong update towards your model for me:

If there is an AI model such that the complexity of R&D problems it can solve (1) scales basically boundlessly with the amount of serial compute provided to it (or to a "research fleet" based on it), (2) scales much faster with serial compute than with parallel compute, and (3) the required amount of human attention ("babysitting") is constant or grows very slowly with the amount of serial compute.

This attempts to directly get at the "autonomous self-correction" and "ability to think about R&D problems strategically" ideas.

I've not fully thought through all possible ways reality could Goodhart to this benchmark, i. e. "technically" pass it but in a way I find unconvincing. For example, if I failed to include the condition (2), o3 would have probably already "passed" it (since it potentially achieved better performance on ARC-AGI and FrontierMath by sampling thousands of CoTs then outputting the most frequent answer). There might be other loopholes like this...

But it currently seems reasonable and True-Name-y to me.

Well, he didn't do it yet either, did he? His new announcement is, likewise, just that: an announcement. Manifold is still 35% on him not following through on it, for example.

"Intensely competent and creative", basically, maybe with a side of "obsessed" (with whatever they're cracked at).

Supposedly Trump announced that back in October, so it should already be priced in.

(Here's my attempt at making sense of it, for what it's worth.)

Here's a potential interpretation of the market's apparent strange reaction to DeepSeek-R1 (shorting Nvidia).

I don't fully endorse this explanation, and the shorting may or may not have actually been due to Trump's tariffs + insider trading, rather than DeepSeek-R1. But I see a world in which reacting this way to R1 arguably makes sense, and I don't think it's an altogether implausible world.

If I recall correctly, the amount of money globally spent on inference dwarfs the amount of money spent on training. Most economic entities are not AGI labs training new models, after all. So the impact of DeepSeek-R1 on the pretraining scaling laws is irrelevant: sure, it did not show that you don't need bigger data centers to get better base models, but that's not where most of the money was anyway.

And my understanding is that, on the inference-time scaling paradigm, there isn't yet any proven method of transforming arbitrary quantities of compute into better performance:

  • Reasoning models are bottlenecked on the length of CoTs that they've been trained to productively make use of. They can't fully utilize even their context windows; the RL pipelines just aren't up to that task yet. And if that bottleneck were resolved, the context-window bottleneck would be next: my understanding is that infinite context/"long-term" memories haven't been properly solved either, and it's unknown how they'd interact with the RL stage (probably they'd interact okay, but maybe not).
    • o3 did manage to boost its ARC-AGI and (maybe?) FrontierMath performance by... generating a thousand guesses and then picking the most common one...? But who knows how that really worked, and how practically useful it is. (See e. g. this, although that paper examines a somewhat different regime.)
  • Agents, from Devin to Operator to random open-source projects, are still pretty terrible. You can't set up an ecosystem of agents in a big data center and let them rip, such that the ecosystem's power scales boundlessly with the data center's size. For all but the most formulaic tasks, you still need a competent human closely babysitting everything they do, which means you're still mostly bottlenecked on competent human attention.

Suppose that you don't expect the situation to improve: that the inference-time scaling paradigm would hit a ceiling pretty soon, or that it'd converge to distilling search into forward passes (such that the end users end up using very little compute on inference, like today), and that agents just aren't going to work out the way the AGI labs promise.

In such a world, a given task can either be completed automatically by an AI for some fixed quantity of compute X, or it cannot be completed by an AI at all. Pouring ten times more compute on it does nothing.

In such a world, if it were shown that the compute needs of a task can be met with ten times less compute than previously expected, this would decrease the expected demand for compute.

The fact that capable models can be run locally might increase the number of people willing to use them (e. g., those very concerned about data privacy), as might the ability to automatically complete 10x as many trivial tasks. But it's not obvious that this demand spike will be bigger than the simultaneous demand drop.

And I, at least, when researching ways to set up DeepSeek-R1 locally, found myself more drawn to the "wire a bunch of Macs together" option, compared to "wire a bunch of GPUs together" (due to the compactness). If many people are like this, it makes sense why Nvidia is down while Apple is (slightly) up. (Moreover, it's apparently possible to run the full 671b-parameter version locally, and at a decent speed, using a pure RAM+CPU setup; indeed, it appears cheaper than mucking about with GPUs/Macs, just $6,000.)

This world doesn't seem outright implausible to me. I'm bearish on agents and somewhat skeptical of inference-time scaling. And if inference-time scaling does deliver on its promises, it'll likely go the way of search-and-distill.

On balance, I don't actually expect the market to have any idea what's going on, so I don't know that its reasoning is this specific flavor of "well-informed but skeptical". And again, it's possible the drop was due to Trump, nothing to do with DeepSeek at all.

But as I'd said, this reaction to DeepSeek-R1 does not seem necessarily irrational/incoherent to me.

I don't have deep expertise in the subject, but I'm inclined to concur with the people saying that the widely broadcast signals don't actually represent one consistent thing, despite your plausible argument to the contrary.

Here's a Scott Alexander post speculating why that might be the case. In short: there was an optimization pressure towards making internal biological signals very difficult to decode, because easily decodable signals were easy target for parasites evolving to exploit them. As the result, the actual signals are probably represented as "unnecessarily" complicated, timing-based combinations of various "basic" chemical, electrical, etc. signals, and they're somewhat individualized to boot. You can't decode them just by looking at any one spatially isolated chunk of the body, by design.

Basically: separate chemical substances (and other components that look "simple" locally/from the outside) are not the privileged basis for decoding internal signals. They're the anti-privileged basis, if anything.

Hmm. This does have the feel of gesturing at something important, but I don't see it clearly yet...

Free association: geometric rationality.

MIRI's old results argue that "corrigibility via uncertainty regarding the utility function" doesn't work, because if the agent maximizes expected utility anyway, it doesn't matter one whit whether we're taking expectation over actions or over utility functions. However, the corrigibility-via-instrumental-goals does have the feel of "make the agent uncertain regarding what goals it will want to pursue next". Is there, therefore, some way to implement something-like-this while avoiding MIRI's counterexample?

Loophole: the counterexample works in the arithmetically-expected utility regime. What if we instead do it in the geometric one? I. e., have an agent take actions that maximize the geometrically-expected product of candidate utility functions? This is a more conservative/egalitarian regime: any one utility function flipping to negative or going to zero wipes out all value, unlike with sums (which are more tolerant of ignoring/pessimizing some terms, and can have "utility monsters"). So it might potentially make the agent actually hesitant to introduce potentially destructive changes to its environment...

(This is a very quick take and it potentially completely misunderstands the concepts involved. But I figure it's better to post than not, in case the connection turns out obvious to anyone else.)

Coming back to this in the wake of DeepSeek r1...

I don't think the cumulative compute multiplier since GPT-4 is that high, I'm guessing 3x, except perhaps for DeepSeek-V3, which wasn't trained compute optimally and didn't use a lot of compute, and so it remains unknown what happens if its recipe is used compute optimally with more compute.

How did DeepSeek accidentally happen to invest precisely the amount of compute into V3 and r1 that would get them into the capability region of GPT-4/o1, despite using training methods that clearly have wildly different returns on compute investment?

Like, GPT-4 was supposedly trained for $100 million, and V3 for $5.5 million. Yet, they're roughly at the same level. That should be very surprising. Investing a very different amount of money into V3's training should've resulted in it either massively underperforming GPT-4, or massively overperforming, not landing precisely in its neighbourhood!

Consider this graph. If we find some training method A, and discover that investing $100 million in it lands us at just above "dumb human", and then find some other method B with a very different ROI, and invest $5.5 million in it, the last thing we should expect is to again land near "dumb human".

Or consider this trivial toy model: You have two linear functions, f(x) = Ax and g(x) = Bx, where x is the compute invested, output is the intelligence of the model, and f and g are different training methods. You pick some x effectively at random (whatever amount of money you happened to have lying around), plug it into f, and get, say, 120. Then you pick a different random value of x, plug it into g, and get... 120 again. Despite the fact that the multipliers A and B are likely very different, and you used very different x-values as well. How come?

The explanations that come to mind are:

  • It actually is just that much of a freaky coincidence.
  • DeepSeek have a superintelligent GPT-6 equivalent that they trained for $10 million in their basement, and V3/r1 are just flexes that they specifically engineered to match GPT-4-ish level.
  • DeepSeek directly trained on GPT-4 outputs, effectively just distilling GPT-4 into their model, hence the anchoring.
  • DeepSeek kept investing and tinkering until getting to GPT-4ish level, and then stopped immediately after attaining it.
  • GPT-4ish neighbourhood is where LLM pretraining plateaus, which is why this capability level acts as a sort of "attractor" into which all training runs, no matter how different, fall.
Load More