Not for math benchmarks. Here's one way it can "cheat" at them: suppose that the CoT would involve the model generating candidate proofs/derivations, then running an internal (learned, not hard-coded) proof verifier on them, and either rejecting the candidate proof and trying to generate a new one, or outputting it. We know that this is possible, since we know that proof verifiers can be compactly specified.
This wouldn't actually show "agency" and strategic thinking of the kinds that might generalize to open-ended domains and "true" long-horizon tasks. In particular, this would mostly fail the condition (2) from my previous comment.
Something more open-ended and requiring "research taste" would be needed. Maybe a comparable performance on METR's benchmark would work for this (i. e., the model can beat a significantly larger fraction of it at 1 billion tokens compared to 1 million)? Or some other benchmark that comes closer to evaluating real-world performance.
Edit: Oh, math-benchmark performance would convince me if we get access to a CoT sample and it shows that the model doesn't follow the above "cheating" approach, but instead approaches the problem strategically (in some sense). (Which would also require this CoT not to be hopelessly steganographied, obviously.)
Prooobably ~simultaneously, but I can maybe see it coming earlier and in a way that isn't wholly convincing to me. In particular, it would still be a fixed-length task; much longer-length than what the contemporary models can reliably manage today, but still hackable using poorly-generalizing "agency templates" instead of fully general "compact generators of agenty behavior" (which I speculate humans to have and RL'd LLMs not to). It would be some evidence in favor of "AI can accelerate AI R&D", but not necessarily "LLMs trained via SSL+RL are AGI-complete".
Actually, I can also see it coming later. For example, some suppose that the capability researchers invent some method for reliably-and-indefinitely extending the amount of serial computations a reasoning model can productively make use of, but the compute or memory requirements grow very fast with the length of a CoT. Some fairly solid empirical evidence and theoretical arguments in favor of boundless scaling can appear quickly, well before the algorithms are made optimal enough to (1) handle weeks-long CoTs and/or (2) allow wide adoption (thus making it available to you).
I think the second scenario is more plausible, actually.
I wish we had something to bet on better than "inventing a new field of science,"
I've thought of one potential observable that is concrete, should be relatively low-capability, and should provoke a strong update towards your model for me:
If there is an AI model such that the complexity of R&D problems it can solve (1) scales basically boundlessly with the amount of serial compute provided to it (or to a "research fleet" based on it), (2) scales much faster with serial compute than with parallel compute, and (3) the required amount of human attention ("babysitting") is constant or grows very slowly with the amount of serial compute.
This attempts to directly get at the "autonomous self-correction" and "ability to think about R&D problems strategically" ideas.
I've not fully thought through all possible ways reality could Goodhart to this benchmark, i. e. "technically" pass it but in a way I find unconvincing. For example, if I failed to include the condition (2), o3 would have probably already "passed" it (since it potentially achieved better performance on ARC-AGI and FrontierMath by sampling thousands of CoTs then outputting the most frequent answer). There might be other loopholes like this...
But it currently seems reasonable and True-Name-y to me.
Well, he didn't do it yet either, did he? His new announcement is, likewise, just that: an announcement. Manifold is still 35% on him not following through on it, for example.
"Intensely competent and creative", basically, maybe with a side of "obsessed" (with whatever they're cracked at).
Supposedly Trump announced that back in October, so it should already be priced in.
(Here's my attempt at making sense of it, for what it's worth.)
Here's a potential interpretation of the market's apparent strange reaction to DeepSeek-R1 (shorting Nvidia).
I don't fully endorse this explanation, and the shorting may or may not have actually been due to Trump's tariffs + insider trading, rather than DeepSeek-R1. But I see a world in which reacting this way to R1 arguably makes sense, and I don't think it's an altogether implausible world.
If I recall correctly, the amount of money globally spent on inference dwarfs the amount of money spent on training. Most economic entities are not AGI labs training new models, after all. So the impact of DeepSeek-R1 on the pretraining scaling laws is irrelevant: sure, it did not show that you don't need bigger data centers to get better base models, but that's not where most of the money was anyway.
And my understanding is that, on the inference-time scaling paradigm, there isn't yet any proven method of transforming arbitrary quantities of compute into better performance:
Suppose that you don't expect the situation to improve: that the inference-time scaling paradigm would hit a ceiling pretty soon, or that it'd converge to distilling search into forward passes (such that the end users end up using very little compute on inference, like today), and that agents just aren't going to work out the way the AGI labs promise.
In such a world, a given task can either be completed automatically by an AI for some fixed quantity of compute X, or it cannot be completed by an AI at all. Pouring ten times more compute on it does nothing.
In such a world, if it were shown that the compute needs of a task can be met with ten times less compute than previously expected, this would decrease the expected demand for compute.
The fact that capable models can be run locally might increase the number of people willing to use them (e. g., those very concerned about data privacy), as might the ability to automatically complete 10x as many trivial tasks. But it's not obvious that this demand spike will be bigger than the simultaneous demand drop.
And I, at least, when researching ways to set up DeepSeek-R1 locally, found myself more drawn to the "wire a bunch of Macs together" option, compared to "wire a bunch of GPUs together" (due to the compactness). If many people are like this, it makes sense why Nvidia is down while Apple is (slightly) up. (Moreover, it's apparently possible to run the full 671b-parameter version locally, and at a decent speed, using a pure RAM+CPU setup; indeed, it appears cheaper than mucking about with GPUs/Macs, just $6,000.)
This world doesn't seem outright implausible to me. I'm bearish on agents and somewhat skeptical of inference-time scaling. And if inference-time scaling does deliver on its promises, it'll likely go the way of search-and-distill.
On balance, I don't actually expect the market to have any idea what's going on, so I don't know that its reasoning is this specific flavor of "well-informed but skeptical". And again, it's possible the drop was due to Trump, nothing to do with DeepSeek at all.
But as I'd said, this reaction to DeepSeek-R1 does not seem necessarily irrational/incoherent to me.
I don't have deep expertise in the subject, but I'm inclined to concur with the people saying that the widely broadcast signals don't actually represent one consistent thing, despite your plausible argument to the contrary.
Here's a Scott Alexander post speculating why that might be the case. In short: there was an optimization pressure towards making internal biological signals very difficult to decode, because easily decodable signals were easy target for parasites evolving to exploit them. As the result, the actual signals are probably represented as "unnecessarily" complicated, timing-based combinations of various "basic" chemical, electrical, etc. signals, and they're somewhat individualized to boot. You can't decode them just by looking at any one spatially isolated chunk of the body, by design.
Basically: separate chemical substances (and other components that look "simple" locally/from the outside) are not the privileged basis for decoding internal signals. They're the anti-privileged basis, if anything.
Hmm. This does have the feel of gesturing at something important, but I don't see it clearly yet...
Free association: geometric rationality.
MIRI's old results argue that "corrigibility via uncertainty regarding the utility function" doesn't work, because if the agent maximizes expected utility anyway, it doesn't matter one whit whether we're taking expectation over actions or over utility functions. However, the corrigibility-via-instrumental-goals does have the feel of "make the agent uncertain regarding what goals it will want to pursue next". Is there, therefore, some way to implement something-like-this while avoiding MIRI's counterexample?
Loophole: the counterexample works in the arithmetically-expected utility regime. What if we instead do it in the geometric one? I. e., have an agent take actions that maximize the geometrically-expected product of candidate utility functions? This is a more conservative/egalitarian regime: any one utility function flipping to negative or going to zero wipes out all value, unlike with sums (which are more tolerant of ignoring/pessimizing some terms, and can have "utility monsters"). So it might potentially make the agent actually hesitant to introduce potentially destructive changes to its environment...
(This is a very quick take and it potentially completely misunderstands the concepts involved. But I figure it's better to post than not, in case the connection turns out obvious to anyone else.)
Certainly (experimenting with r1's CoTs right now, in fact). I agree that they're not doing the brute-force stuff I mentioned; that was just me outlining a scenario in which a system "technically" clears the bar you'd outlined, yet I end up unmoved (I don't want to end up goalpost-moving).
Though neither are they being "strategic" in the way I expect they'd need to be in order to productively use a billion-token CoT.
Yeah, I'm also glad to finally have something concrete-ish to watch out for. Thanks for prompting me!