I'm going to go against the flow here and not be easily impressed. I suppose it might just be copium.
Any actual reason to expect that the new model beating these challenging benchmarks, which have previously remained unconquered, is any more of a big deal than the last several times a new model beat a bunch of challenging benchmarks that have previously remained unconquered?
Don't get me wrong, I'm sure it's amazingly more capable in the domains in which it's amazingly more capable. But I see quite a lot of "AGI achieved" panicking/exhilaration in various discussions, and I wonder whether it's more justified this time than the last several times this pattern played out. Does anything indicate that this capability advancement is going to generalize in a meaningful way to real-world tasks and real-world autonomy, rather than remaining limited to the domain of extremely well-posed problems?
One of the reasons I'm skeptical is the part where it requires thousands of dollars' worth of inference-time compute. Implies it's doing brute force at extreme scale, which is a strategy that'd only work for, again, domains of well-posed problems with easily verifiable solutions. Similar to how o1 bl...
To first order, I believe a lot of the reason why the "AGI achieved" shrill posting often tends to be overhyped is that not because the models are theoretically incapable, but rather that reliability was way more of a requirement for it to replace jobs fast than people realized, and there are only a very few jobs where an AI agent can do well without instantly breaking down because it can't error-correct/be reliable, and I think this has been continually underestimated by AI bulls.
Indeed, one of my broader updates is that a capability is only important to the broader economy if it's very, very reliable, and I agree with Leo Gao and Alexander Gietelink Oldenziel that reliability is a bottleneck way more than people thought:
https://www.lesswrong.com/posts/YiRsCfkJ2ERGpRpen/leogao-s-shortform#f5WAxD3WfjQgefeZz
https://www.lesswrong.com/posts/YiRsCfkJ2ERGpRpen/leogao-s-shortform#YxLCWZ9ZfhPdjojnv
Not to say it's a nothingburger, of course. But I'm not feeling the AGI here.
These math and coding benchmarks are so narrow that I'm not sure how anybody could treat them as saying anything about "AGI". LLMs haven't even tried to be actually general.
How close is "the model" to passing the Woz test (go into a strange house, locate the kitchen, and make a cup of coffee, implicitly without damaging or disrupting things)? If you don't think the kinesthetic parts of robotics count as part of "intelligence" (and why not?), then could it interactively direct a dumb but dextrous robot to do that?
Can it design a nontrivial, useful physical mechanism that does a novel task effectively and can be built efficiently? Produce usable, physically accurate drawings of it? Actually make it, or at least provide a good enough design that it can have it made? Diagnose problems with it? Improve the design based on observing how the actual device works?
Can it look at somebody else's mechanical design and form a reasonably reliable opinion about whether it'll work?
Even in the coding domain, can it build and deploy an entire software stack offering a meaningful service on a real server without assistanc...
It’s not AGI, but for human labor to retain any long-term value, there has to be an impenetrable wall that AI research hits, and this result rules out a small but nonzero number of locations that wall might’ve been.
It's not really dangerous real AGI yet. But it will be soon this is a version that's like a human with severe brain damage to the frontal lobes that provide agency and self-management, and the temporal lobe, which does episodic memory and therefore continuous, self-directed learning.
Those things are relatively easy to add, since it's smart enough to self-manage as an agent and self-direct its learning. Episodic memory systems exist and only need modest improvements - some low-hanging fruit are glaringly obvious from a computational neuroscience perspective, so I expect them to be employed almost as soon as a competent team starts working on episodic memory.
Don't indulge in even possible copium. We need your help to align these things, fast. The possibility of dangerous AGI soon can no longer be ignored.
Gambling that the gaps in LLMs abilities (relative to humans) won't be filled soon is a bad gamble.
It's hard to find numbers. Here's what I've been able to gather (please let me know if you find better numbers than these!). I'm mostly focusing on FrontierMath.
This is actually likely more expensive than hiring a domain-specific expert mathematician for each problem
I don't think anchoring to o3's current cost-efficiency is a reasonable thing to do. Now that AI has the capability to solve these problems in-principle, buying this capability is probably going to get orders of magnitude cheaper within the next five minutes months, as they find various algorithmic shortcuts.
I would guess that OpenAI did this using a non-optimized model because they expected it to be net beneficial: that producing a headline-grabbing result now will attract more counterfactual investment than e. g. the $900k they'd save by running the benchmarks half a year later.
Edit: In fact, if, against these expectations, the implementation of o3's trick can't be made orders-of-magnitude cheaper (say, because a base model of a given size necessarily takes ~n tries/MCTS branches per a FrontierMath problem and you can't get more efficient than one try per try), that would make me do a massive update against the "inference-time compute" paradigm.
OpenAI didn't fine-tune on ARC-AGI, even though this graph suggests they did.
Sources:
Altman said
we didn't go do specific work [targeting ARC-AGI]; this is just the general effort.
François Chollet (in the blogpost with the graph) said
Note on "tuned": OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.
The version of the model we tested was domain-adapted to ARC-AGI via the public training set (which is what the public training set is for). As far as I can tell they didn't generate synthetic ARC data to improve their score.
An OpenAI staff member replied
Correct, can confirm "targeting" exclusively means including a (subset of) the public training set.
and further confirmed that "tuned" in the graph is
a strange way of denoting that we included ARC training examples in the O3 training. It isn’t some finetuned version of O3 though. It is just O3.
Another OpenAI staff member said
...also: the model we used for all of our o3 evals is fully general; a subset of the arc-agi public training set was a tiny fraction of the bro
“Scaling is over” was sort of the last hope I had for avoiding the “no one is employable, everyone starves” apocalypse. From that frame, the announcement video from openai is offputtingly cheerful.
I don't emphasize this because I care more about humanity's survival than the next decades sucking really hard for me and everyone I love.
I'm flabbergasted by this degree/kind of altruism. I respect you for it, but I literally cannot bring myself to care about "humanity"'s survival if it means the permanent impoverishment, enslavement or starvation of everybody I love. That future is simply not much better on my lights than everyone including the gpu-controllers meeting a similar fate. In fact I think my instincts are to hate that outcome more, because it's unjust.
But how do LW futurists not expect catastrophic job loss that destroys the global economy?
Slight correction: catastrophic job loss would destroy the ability of the non-landed, working public to paritcipate in and extract value from the global economy. The global economy itself would be fine. I agree this is a natural conclusion; I guess people were hoping to get 10 or 15 more years out of their natural gifts.
Fucking o3. This pace of improvement looks absolutely alarming. I would really hate to have my fast timelines turn out to be right.
The "alignment" technique, "deliberative alignment", is much better than constitutional AI. It's the same during training, but it also teaches the model the safety criteria, and teaches the model to reason about them at runtime, using a reward model that compares the output to their specific safety criteria. (This also suggests something else I've been expecting - the CoT training technique behind o1 doesn't need perfectly verifiable answers in coding and math, it can use a decent guess as to the better answer in what's probably the same procedure).
While safety is not alignment (SINA?), this technique has a lot of promise for actual alignment. By chance, I've been working on an update to my Internal independent review for language model agent alignment, and have been thinking about how this type of review could be trained instead of scripted into an agent as I'd originally predicted.
This is that technique. It does have some promise.
But I don't think OpenAI has really thought through the consequences of using their smarter-than-human models with scaffold...
Oh dear, RL for everything, because surely nobody's been complaining about the safety profile of doing RL directly on instrumental tasks rather than on goals that benefit humanity.
I was still hoping for a sort of normal life. At least for a decade or maybe more. But that just doesn't seem possible anymore. This is a rough night.
My probably contrarian take is that I don't think improvement on a benchmark of math problems is particularly scary or relevant. It's not nothing -- I'd prefer if it didn't improve at all -- but it only makes me slightly more worried.
About two years ago I made a set of 10 problems that imo measure progress toward AGI and decided I'd freak out if/when LLMs solve them. They're still 1/10 and nothing has changed in the past year, and I doubt o3 will do better. (But I'm not making them public.)
Will write a reply to this comment when I can test it.
Regarding whether this is a new base model, we have the following evidence:
o3 is very performant. More importantly, progress from o1 to o3 was only three months, which shows how fast progress will be in the new paradigm of RL on chain of thought to scale inference compute. Way faster than pretraining paradigm of new model every 1-2 years
o1 was the first large reasoning model — as we outlined in the original “Learning to Reason” blog, it’s “just” an LLM trained with RL. o3 is powered by further scaling up RL beyond o1, and the strength of the resulting model the resulting model is very, very impressive. (2/n)
The prices leaked by ARC-ARG people indicate $60/million output tokens, which is also the current o1 pricing. 33m total tokens and a cost of $2,012.
Notably, the codeforces graph with pricing puts o3 about 3x higher than o1 (tho maybe it's a secretly log scale), and the ARC-AGI graph has the cost of o3 being 10-20x that of o1-preview. Maybe this indicates it does a bunch more test-time reasoning. That's collaborated by ARC-AGI, average 55k tokens per solution, which seems like a ton.
I think this evidence indicates this is likely the same b...
Using $4K per task means a lot of inference in parallel, which wasn't in o1. So that's one possible source of improvement, maybe it's running MCTS instead of individual long traces (including on low settings at $20 per task). And it might be built on the 100K H100s base model.
The scary less plausible option is that RL training scales, so it's mostly o1 trained with more compute, and $4K per task is more of an inefficient premium option on top rather than a higher setting on o3's source of power.
Performance at $20 per task is already much better than for o1, so it can't be just best-of-n, you'd need more attempts to get that much better even if there is a very good verifier that notices a correct solution (at $4K per task that's plausible, but not at $20 per task). There are various clever beam search options that don't need to make inference much more expensive, but in principle might be able to give a boost at low expense (compared to not using them at all).
There's still no word on the 100K H100s model, so that's another possibility. Currently Claude 3.5 Sonnet seems to be better at System 1, while OpenAI o1 is better at System 2, and combining these advantages in o3 based on a yet-unannounced GPT-4.5o base model that's better than Claude 3.5 Sonnet might be sufficient to explain the improvement. Without any public 100K H100s Chinchilla optimal models it's hard to say how much that alone should help.
I wish they would tell us what the dark vs light blue means. Specifically, for the FrontierMath benchmark, the dark blue looks like it's around 8% (rather than the light blue at 25.2%). Which like, I dunno, maybe this is nit picking, but 25% on FrontierMath seems like a BIG deal, and I'd like to know how much to be updating my beliefs.
From an apparent author on reddit:
[Frontier Math is composed of] 25% T1 = IMO/undergrad style problems, 50% T2 = grad/qualifying exam style porblems, 25% T3 = early researcher problems
The comment was responding to a claim that Terence Tao said he could only solve a small percentage of questions, but Terence was only sent the T3 questions.
As I say here https://x.com/boazbaraktcs/status/1870369979369128314
Constitutional AI is a great work but Deliberative Alignment is fundamentally different. The difference is basically system 1 vs system 2. In RLAIF ultimately the generative model that answers user prompt is trained with (prompt, good response, bad response). Even if the good and bad responses were generated based on some constitution, the generative model is not taught the text of this constitution, and most importantly how to reason about this text in the context of a particular example.
T...
While I'm not surprised by the pessimism here, I am surprised at how much of it is focused on personal job loss. I thought there would be more existential dread.
I just straight up don't believe the codeforces rating. I guess only a small subset of people solve algorithmic problems for fun in their free time, so it's probably opaque to many here, but a rating of 2727 (the one in the table) would be what's called an international grandmaster and is the 176th best rating among all actively competing users on the site. I hope they will soon release details about how they got that performance measure..
First post, feel free to meta-critique my rigor on this post as I am not sure what is mandatory, expected, or superfluous for a comment under a post. Studying computer science but have no degree, yet. Can pull the specific citation if necessary, but...
these benchmarks don't feel genuine.
Chollet indicated in his piece:
...Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score ov
That's some significant progress, but I don't think will lead to TAI.
However there is a realistic best case scenario where LLM/Transformer stop just before and can give useful lessons and capabilities.
I would really like to see such an LLM system get as good as a top human team at security, so it could then be used to inspect and hopefully fix masses of security vulnerabilities. Note that could give a false sense of security, unknown unknown type situation where it would't find a totally new type of attack, say a combined SW/HW attack like Rowhammer/Meltdown but more creative. A superintelligence not based on LLM could however.
OpenAI didn't say what the light blue bar is
Presumably light blue is o3 high, and dark blue is o3 low?
For people who don't expect a strong government response... remember that Elon is First Buddy now. 🎢
Beating benchmarks, even very difficult ones, is all find and dandy, but we must remember that those tests, no matter how difficult, are at best only a limited measure of human ability. Why? Because they present the test-take with a well-defined situation to which they must respond. Life isn't like that. It's messy and murky. Perhaps the most difficult step is to wade into the mess and the murk and impose a structure on it – perhaps by simply asking a question – so that one can then set about dealing with that situation in terms of the imposed structure. T...
See livestream, site, OpenAI thread, Nat McAleese thread.
OpenAI announced (but isn't yet releasing) o3 and o3-mini (skipping o2 because of telecom company O2's trademark). "We plan to deploy these models early next year." "o3 is powered by further scaling up RL beyond o1"; I don't know whether it's a new base model.
o3 gets 25% on FrontierMath, smashing the previous SoTA. (These are really hard math problems.[1]) Wow. (The dark blue bar, about 7%, is presumably one-attempt and most comparable to the old SoTA; unfortunately OpenAI didn't say what the light blue bar is, but I think it doesn't really matter and the 25% is for real.[2])
o3 also is easily SoTA on SWE-bench Verified and Codeforces.
It's also easily SoTA on ARC-AGI,
after doing RL on the public ARC-AGI problems[3] + when spending $4,000 per task on inference (!).[4]OpenAI has a "new alignment strategy." (Just about the "modern LLMs still comply with malicious prompts, overrefuse benign queries, and fall victim to jailbreak attacks" problem.) It looks like RLAIF/Constitutional AI. See Lawrence Chan's thread.
OpenAI says "We're offering safety and security researchers early access to our next frontier models"; yay.
o3-mini will be able to use a low, medium, or high amount of inference compute, depending on the task and the user's preferences. o3-mini (medium) outperforms o1 (at least on Codeforces and the 2024 AIME) with less inference cost.
GPQA Diamond:
Update: most of them are not as hard as I thought:
My guess is it's consensus@128 or something (i.e. write 128 answers and submit the most common one). Even if it's pass@n (i.e. submit n tries) rather than consensus@n, that's likely reasonable because I heard FrontierMath is designed to have easier-to-verify numerical-ish answers.
Update: it's not pass@n.
Correction: no RL! See comment.
It's not clear how they can leverage so much inference compute; they must be doing more than consensus@n. See Vladimir_Nesov's comment.