OpenAIAI
Personal Blog
2024 Top Fifty: 12%

139

1 min read

139

See livestream, site, OpenAI thread, Nat McAleese thread.

OpenAI announced (but isn't yet releasing) o3 and o3-mini (skipping o2 because of telecom company O2's trademark). "We plan to deploy these models early next year." "o3 is powered by further scaling up RL beyond o1"; I don't know whether it's a new base model.

o3 gets 25% on FrontierMath, smashing the previous SoTA. (These are really hard math problems.[1]) Wow. (The dark blue bar, about 7%, is presumably one-attempt and most comparable to the old SoTA; unfortunately OpenAI didn't say what the light blue bar is, but I think it doesn't really matter and the 25% is for real.[2])

o3 also is easily SoTA on SWE-bench Verified and Codeforces.

It's also easily SoTA on ARC-AGI, after doing RL on the public ARC-AGI problems[3] + when spending $4,000 per task on inference (!).[4]

OpenAI has a "new alignment strategy." (Just about the "modern LLMs still comply with malicious prompts, overrefuse benign queries, and fall victim to jailbreak attacks" problem.) It looks like RLAIF/Constitutional AI. See Lawrence Chan's thread.

OpenAI says "We're offering safety and security researchers early access to our next frontier models"; yay.

o3-mini will be able to use a low, medium, or high amount of inference compute, depending on the task and the user's preferences. o3-mini (medium) outperforms o1 (at least on Codeforces and the 2024 AIME) with less inference cost.

GPQA Diamond:

  1. ^

    Update: most of them are not as hard as I thought:

    There are 3 tiers of difficulty within FrontierMath: 25% T1 = IMO/undergrad style problems, 50% T2 = grad/qualifying exam style [problems], 25% T3 = early researcher problems.

  2. ^

    My guess is it's consensus@128 or something (i.e. write 128 answers and submit the most common one). Even if it's pass@n (i.e. submit n tries) rather than consensus@n, that's likely reasonable because I heard FrontierMath is designed to have easier-to-verify numerical-ish answers.

    Update: it's not pass@n.

  3. ^

    Correction: no RL! See comment.

  4. ^

    It's not clear how they can leverage so much inference compute; they must be doing more than consensus@n. See Vladimir_Nesov's comment.

New Comment
107 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

I'm going to go against the flow here and not be easily impressed. I suppose it might just be copium.

Any actual reason to expect that the new model beating these challenging benchmarks, which have previously remained unconquered, is any more of a big deal than the last several times a new model beat a bunch of challenging benchmarks that have previously remained unconquered?

Don't get me wrong, I'm sure it's amazingly more capable in the domains in which it's amazingly more capable. But I see quite a lot of "AGI achieved" panicking/exhilaration in various discussions, and I wonder whether it's more justified this time than the last several times this pattern played out. Does anything indicate that this capability advancement is going to generalize in a meaningful way to real-world tasks and real-world autonomy, rather than remaining limited to the domain of extremely well-posed problems?

One of the reasons I'm skeptical is the part where it requires thousands of dollars' worth of inference-time compute. Implies it's doing brute force at extreme scale, which is a strategy that'd only work for, again, domains of well-posed problems with easily verifiable solutions. Similar to how o1 bl... (read more)

Reply44321

To first order, I believe a lot of the reason why the "AGI achieved" shrill posting often tends to be overhyped is that not because the models are theoretically incapable, but rather that reliability was way more of a requirement for it to replace jobs fast than people realized, and there are only a very few jobs where an AI agent can do well without instantly breaking down because it can't error-correct/be reliable, and I think this has been continually underestimated by AI bulls.

Indeed, one of my broader updates is that a capability is only important to the broader economy if it's very, very reliable, and I agree with Leo Gao and Alexander Gietelink Oldenziel that reliability is a bottleneck way more than people thought:

https://www.lesswrong.com/posts/YiRsCfkJ2ERGpRpen/leogao-s-shortform#f5WAxD3WfjQgefeZz

https://www.lesswrong.com/posts/YiRsCfkJ2ERGpRpen/leogao-s-shortform#YxLCWZ9ZfhPdjojnv

2Thane Ruthenis
I agree that this seems like an important factor. See also this post making a similar point.
3Noosphere89
To be clear, I do expect AI to accelerate AI research, and AI research may be one of the few exceptions to this rule, but it's one of the reasons I have longer timelines nowadays than a lot of other people, and also why I expect AI impact on the economy to be surprisingly discontinuous in practice, and is a big reason I expect AI governance have few laws passed until very near the end of the AI as complement era for most jobs that are not AI research. The post you linked is pretty great, thanks for sharing.
[-]jbash242

Not to say it's a nothingburger, of course. But I'm not feeling the AGI here.

These math and coding benchmarks are so narrow that I'm not sure how anybody could treat them as saying anything about "AGI". LLMs haven't even tried to be actually general.

How close is "the model" to passing the Woz test (go into a strange house, locate the kitchen, and make a cup of coffee, implicitly without damaging or disrupting things)? If you don't think the kinesthetic parts of robotics count as part of "intelligence" (and why not?), then could it interactively direct a dumb but dextrous robot to do that?

Can it design a nontrivial, useful physical mechanism that does a novel task effectively and can be built efficiently? Produce usable, physically accurate drawings of it? Actually make it, or at least provide a good enough design that it can have it made? Diagnose problems with it? Improve the design based on observing how the actual device works?

Can it look at somebody else's mechanical design and form a reasonably reliable opinion about whether it'll work?

Even in the coding domain, can it build and deploy an entire software stack offering a meaningful service on a real server without assistanc... (read more)

It’s not AGI, but for human labor to retain any long-term value, there has to be an impenetrable wall that AI research hits, and this result rules out a small but nonzero number of locations that wall might’ve been.

It's not really dangerous real AGI yet. But it will be soon this is a version that's like a human with severe brain damage to the frontal lobes that provide agency and self-management, and the temporal lobe, which does episodic memory and therefore continuous, self-directed learning.

Those things are relatively easy to add, since it's smart enough to self-manage as an agent and self-direct its learning. Episodic memory systems exist and only need modest improvements - some low-hanging fruit are glaringly obvious from a computational neuroscience perspective, so I expect them to be employed almost as soon as a competent team starts working on episodic memory.

Don't indulge in even possible copium. We need your help to align these things, fast. The possibility of dangerous AGI soon can no longer be ignored.

 

Gambling that the gaps in LLMs abilities (relative to humans) won't be filled soon is a bad gamble.

5Foyle
A very large amount of human problem solving/innovation in challenging areas is creating and evaluating potential solutions, it is a stochastic rather than deterministic process.  My understanding is that our brains are highly parallelized in evaluating ideas in thousands of 'cortical columns' a few mm across (Jeff Hawkin's 1000 brains formulation) with an attention mechanism that promotes the filtered best outputs of those myriad processes forming our 'consciousness'. So generating and discarding large numbers of solutions within simpler 'sub brains', via iterative, or parallelized operation is very much how I would expect to see AGI and SI develop.
[-]LGS28-4

It's hard to find numbers. Here's what I've been able to gather (please let me know if you find better numbers than these!). I'm mostly focusing on FrontierMath.

  1. Pixel counting on the ARC-AGI image, I'm getting $3,644 ± $10 per task.
  2. FrontierMath doesn't say how many questions they have (!!!). However, they have percent breakdowns by subfield, and those percents are given to the nearest 0.1%; using this, I can narrow the range down to 289-292 problems in the dataset. Previous models solve around 3 problems (4 problems in total were ever solved by any previous model, though the full o1 was not evaluated, only o1-preview was)
  3. o3 solves 25.2% of FrontierMath. This could be 73/290. But it is also possible that some questions were removed from the dataset (e.g. because they're publicly available). 25.2% could also be 72/286 or 71/282, for example.
  4. The 280 to 290 problems means a rough ballpark for a 95% confidence interval for FrontierMath would be [20%, 30%]. It is pretty strange that the ML community STILL doesn't put confidence intervals on their benchmarks. If you see a model achieve 30% on FrontierMath later, remember that its confidence interval would overlap with o3's. (Edit: actuall
... (read more)

This is actually likely more expensive than hiring a domain-specific expert mathematician for each problem

I don't think anchoring to o3's current cost-efficiency is a reasonable thing to do. Now that AI has the capability to solve these problems in-principle, buying this capability is probably going to get orders of magnitude cheaper within the next five minutes months, as they find various algorithmic shortcuts.

I would guess that OpenAI did this using a non-optimized model because they expected it to be net beneficial: that producing a headline-grabbing result now will attract more counterfactual investment than e. g. the $900k they'd save by running the benchmarks half a year later.

Edit: In fact, if, against these expectations, the implementation of o3's trick can't be made orders-of-magnitude cheaper (say, because a base model of a given size necessarily takes ~n tries/MCTS branches per a FrontierMath problem and you can't get more efficient than one try per try), that would make me do a massive update against the "inference-time compute" paradigm.

7LGS
I think AI obviously keeps getting better. But I don't think "it can be done for $1 million" is such strong evidence for "it can be done cheaply soon" in general (though the prior on "it can be done cheaply soon" was not particularly low ante -- it's a plausible statement for other reasons). Like if your belief is "anything that can be done now can be done 1000x cheaper within 5 months", that's just clearly false for nearly every AI milestone in the last 10 years (we did not get gpt4 that's 1000x cheaper 5 months later, nor alphazero, etc).
3Thane Ruthenis
I'll admit I'm not very certain in the following claims, but here's my rough model: * The AGI labs focus on downscaling the inference-time compute costs inasmuch as this makes their models useful for producing revenue streams or PR. They don't focus on it as much beyond that; it's a waste of their researchers' time. The amount of compute at OpenAI's internal disposal is well, well in excess of even o3's demands. * This means an AGI lab improves the computational efficiency of a given model up to the point at which they could sell it/at which it looks impressive, then drop that pursuit. And making e. g. GPT-4 10x cheaper isn't a particularly interesting pursuit, so they don't focus on that. * Most of the models of the past several years have only been announced near the point at which they were ready to be released as products. I. e.: past the point at which they've been made compute-efficient enough to be released. * E. g., they've spent months post-training GPT-4, and we only hear about stuff like Sonnet 3.5.1 or Gemini Deep Research once it's already out. * o3, uncharacteristically, is announced well in advance of its release. I'm getting the sense, in fact, that we might be seeing the raw bleeding edge of the current AI state-of-the-art for the first time in a while. Perhaps because OpenAI felt the need to urgently counter the "data wall" narratives. * Which means that, unlike the previous AIs-as-products releases, o3 has undergone ~no compute-efficiency improvements, and there's a lot of low-hanging fruit there. Or perhaps any part of this story is false. As I said, I haven't been keeping a close enough eye on this part of things to be confident in it. But it's my current weakly-held strong view.
3LGS
So far as I know, it is not the case that OpenAI had a slower-but-equally-functional version of GPT4 many months before announcement/release. What they did have is GPT4 itself, months before; but they did not have a slower version. They didn't release a substantially distilled version. For example, the highest estimate I've seen is that they trained a 2-trillion-parameter model. And the lowest estimate I've seen is that they released a 200-billion-parameter model. If both are true, then they distilled 10x... but it's much more likely that only one is true, and that they released what they trained, distilling later. (The parameter count is proportional to the inference cost.) Previously, delays in release were believed to be about post-training improvements (e.g. RLHF) or safety testing. Sure, there were possibly mild infrastructure optimizations before release, but mostly to scale to many users; the models didn't shrink. This is for language models. For alphazero, I want to point out that it was announced 6 years ago (infinity by AI scale), and from my understanding we still don't have a 1000x faster version, despite much interest in one.
3IC Rainbow
I don't know the details, but whatever the NN thing (derived from Lc0, a clone of AlphaZero) inside current Stockfish is can play on a laptop GPU. And even if AlphaZero derivatives didn't gain 3OOMs by themselves it doesn't update me much that that's something particularly hard. Google itself has no interest at improving it further and just moved on to MuZero, to AlphaFold etc.
5yo-cuddles
small nudge: the questions have difficulty tiers of 25% easy, 50% medium, and 25% hard with easy being undergrad/IMO difficulty and hard being the sort you would give to a researcher in training. The 25% accuracy gives me STRONG indications that it just got the easy ones, and the starkness of this cutoff makes me think there is something categorically different about the easy ones that make them MUCH easier to solve, either being more close ended, easy to verify, or just leaked into the dataset in some form.

OpenAI didn't fine-tune on ARC-AGI, even though this graph suggests they did.

Sources:

Altman said

we didn't go do specific work [targeting ARC-AGI]; this is just the general effort.

François Chollet (in the blogpost with the graph) said

Note on "tuned": OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.

and

The version of the model we tested was domain-adapted to ARC-AGI via the public training set (which is what the public training set is for). As far as I can tell they didn't generate synthetic ARC data to improve their score.

An OpenAI staff member replied

Correct, can confirm "targeting" exclusively means including a (subset of) the public training set.

and further confirmed that "tuned" in the graph is

a strange way of denoting that we included ARC training examples in the O3 training. It isn’t some finetuned version of O3 though. It is just O3.

Another OpenAI staff member said

also: the model we used for all of our o3 evals is fully general; a subset of the arc-agi public training set was a tiny fraction of the bro

... (read more)
3Vladimir_Nesov
Probably a dataset for RL, that is the model was trained to try and try again to solve these tests with long chains of reasoning, not just tuned or pretrained on them, as a detail like 75% of 400 examples sounds like a test-centric dataset design decision, with the other 25% going to the validation part of the dataset. Seems plausible they trained on ALL the tests, specifically targeting various tests. The public part of ARC-AGI is "just" a part of that dataset of all the tests. Could be some part of explaining the o1/o3 difference in $20 tier.
1Knight Lee
Thank you so much for your research! I would have never found these statements. I'm still quite suspicious. Why would they be "including a (subset of) the public training set"? Is it accidental data contamination? They don't say so. Do they think simply including some questions and answers without reinforcement learning or reasoning would help the model solve other such questions? That's possible but not very likely. Were they "including a (subset of) the public training set" in o3's base training data? Or in o3's reinforcement learning problem/answer sets? Altman never said "we didn't go do specific work [targeting ARC-AGI]; this is just the general effort." Instead he said, The gist I get is that he admits to targeting it but that OpenAI targets all kinds of problem/answer sets for reinforcement learning, not just ARC's public training set. It felt like he didn't want to talk about this too much, from the way he interrupted himself and changed the topic without clarifying what he meant. The other sources do sort of imply no reinforcement learning. I'll wait to see if they make a clearer denial of reinforcement learning, rather than a "nondenial denial" which can be reinterpreted as "we didn't fine-tune o3 in the sense we didn't use a separate derivative of o3 (that's fine-tuned for just the test) to take the test." My guess is o3 is tuned using the training set, since François Chollet (developer of ARC) somehow decided to label o3 as "tuned" and OpenAI isn't racing to correct this.
0Knight Lee
See my other comment instead. The key question is "how much of the performance is due to ARC-AGI data." If the untuned o3 was anywhere as good as the tuned o3, why didn't they test it and publish it? If the most important and interesting test result is somehow omitted, take things with a grain of salt. I admit that running the test is extremely expensive, but there should be compromises like running the cheaper version or only doing a few questions. Edit: oh that reply seems to deny reinforcement learning or at least "fine tuning." I don't understand why François Chollet calls the model "tuned" then. Maybe wait for more information I guess. Edit again: I'm still not sure yet. They might be denying that it's a separate version of o3 finetuned to do ARC questions, while not denying they did reinforcement learning on the ARC public training set. I guess a week or so later we might find out what "tuned" truly means. [edited more]

“Scaling is over” was sort of the last hope I had for avoiding the “no one is employable, everyone starves” apocalypse. From that frame, the announcement video from openai is offputtingly cheerful.

4Seth Herd
Really. I don't emphasize this because I care more about humanity's survival than the next decades sucking really hard for me and everyone I love. But how do LW futurists not expect catastrophic job loss that destroys the global economy?
[-]lc112

I don't emphasize this because I care more about humanity's survival than the next decades sucking really hard for me and everyone I love.

I'm flabbergasted by this degree/kind of altruism. I respect you for it, but I literally cannot bring myself to care about "humanity"'s survival if it means the permanent impoverishment, enslavement or starvation of everybody I love. That future is simply not much better on my lights than everyone including the gpu-controllers meeting a similar fate. In fact I think my instincts are to hate that outcome more, because it's unjust.

But how do LW futurists not expect catastrophic job loss that destroys the global economy?

Slight correction: catastrophic job loss would destroy the ability of the non-landed, working public to paritcipate in and extract value from the global economy. The global economy itself would be fine. I agree this is a natural conclusion; I guess people were hoping to get 10 or 15 more years out of their natural gifts.

7Seth Herd
Thank you. Oddly, I am less altruistic than many EA/LWers. They routinely blow me away. I can only maintain even that much altruism because I think there's a very good chance that the future could be very, very good for a truly vast number of humans and conscious AGIs. I don't think it's that likely that we get a perpetual boot-on-face situation. I think only about 1% of humans are so sociopathic AND sadistic in combination that they wouldn't eventually let their tiny sliver of empathy cause them to use their nearly-unlimited power to make life good for people. They wouldn't risk giving up control, just share enough to be hailed as a benevolent hero instead of merely god-emperor for eternity. I have done a little "metta" meditation to expand my circle of empathy. I think it makes me happy; I can "borrow joy". The side effect is weird decisions like letting my family suffer so that more strangers can flourish in a future I probably won't see.
6Tenoke
Survival is obviously much better because 1. You can lose jobs but eventually still have a good life (think UBI at minimum) and 2. Because if you don't like it you can always kill yourself and be in the same spot as the non-survival case anyway.
9Rafael Harth
Not to get too morbid here but I don't think this is a good argument. People tend not to commit suicide even if they have strongly net negative lives
5Richard_Kennaway
Who would the producers of stuff be selling it to in that scenario? BTW, I recently saw the suggestion that discussions of “the economy” can be clarified by replacing the phrase with “rich people’s yacht money”. There’s something in that. If 90% of the population are destitute, then 90% of the farms and factories have to shut down for lack of demand (i.e. not having the means to buy), which puts more out of work, until you get a world in which a handful of people control the robots that keep them in food and yachts and wait for the masses to die off. I wonder if there are any key players who would welcome that scenario. Average utilitarianism FTW! At least, supposing there are still any people controlling the robots by then.
2Seth Herd
That's what would happen, and the fact that nobody wanted it to happen wouldn't help. It's a Tragedy of the Commons situation.
4O O
Why would that be the likely case? Are you sure it's likely or are you just catastrophizing?
8O O
I expect the US or Chinese government to take control of these systems sooner than later to maintain sovereignty. I also expect there will be some force to counteract the rapid nominal deflation that would happen if there was mass job loss. Every ultra rich person now relies on billions of people buying their products to give their companies the valuation they have.  I don't think people want nominal deflation even if it's real economic growth. This will result in massive printing from the fed that probably lands in poeple's pockets (Iike covid checks).
6Noosphere89
I think this is reasonably likely, but not a guaranteed outcome, and I do think there's a non-trivial chance that the US regulates it way too late to matter, because I expect mass job loss to be one of the last things AI does, due to pretty severe reliability issues with current AI.
5Foyle
I think Elon will bring strong concern about AI to fore in current executive - he was an early voice for AI safety though he seems too have updated to a more optimistic view (and is pushing development through x-AI) he still generally states P(doom) ~10-20%.  His antipathy towards Altman and Google founders is likely of benefit for AI regulation too - though no answer for the China et al AGI development problem.
4Seth Herd
I also expect government control; see If we solve alignment, do we die anyway? for musings about the risks thereof. But it is a possible partial solution to job loss. It's a lot tougher to pass a law saying "no one can make this promising new technology even though it will vastly increase economic productivity" than to just show up to one company and say "heeeey so we couldn't help but notice you guys are building something that will utterly shift the balance of power in the world.... can we just, you know, sit in and hear what you're doing with it and maybe kibbitz a bit?" Then nationalize it officially if and when that seems necessary.
7Thane Ruthenis
I actually think doing the former is considerably more in line with the way things are done/closer to the Overton window.
6Seth Herd
For politicians, yes - but the new administration looks to be strongly pro-tech (unless DJ Trump gets a bee in his bonnet and turns dramatically anti-Musk). For the national security apparatus, the second seems more in line with how they get things done. And I expect them to twig to the dramatic implications much faster than the politicians do. In this case, there's not even anything illegal or difficult about just having some liasons at OAI and an informal request to have them present in any important meetings. At this point I'd be surprised to see meaningful legislation slowing AI/AGI progress in the US, because the "we're racing China" narrative is so compelling - particularly to the good old military-industrial complex, but also to people at large. Slowing down might be handing the race to China, or at least a near-tie. I am becoming more sure that would beat going full-speed without a solid alignment plan. Despite my complete failure to interest anyone in the question of Who is Xi Jinping? in terms of how he or his successors would use AGI. I don't think he's sociopathic/sadistic enough to create worse x-risks or s-risks than rushing to AGI does. But I don't know.
-3O O
We still somehow got the steam engine, electricity, cars, etc.   There is an element of international competition to it. If we slack here, China will probably raise armies of robots with unlimited firepower and take over the world. (They constantly show aggression) The longshoreman strike is only allowed (I think) because the west coast did automate and somehow are less efficient than the east coast for example. 
4Thane Ruthenis
Counterpoints: nuclear power, pharmaceuticals, bioengineering, urban development. Or maybe they will accidentally ban AI too due to being a dysfunctional autocracy, as autocracies are wont to do, all the while remaining just as clueless regarding what's happening as their US counterparts banning AI to protect the jobs. I don't really expect that to happen, but survival-without-dignity scenarios do seem salient.
4O O
I think a lot of this is wishful thinking from safetyists who want AI development to stop. This may be reductionist but almost every pause historically can be explained economics.  Nuclear - war usage is wholly owned by the state and developed to its saturation point (i.e. once you have nukes that can kill all your enemies, there is little reason to develop them more). Energy-wise, supposedly, it was hamstrung by regulation, but in countries like China where development went unfettered, they are still not dominant. This tells me a lot it not being developed is it not being economical.  For bio related things, Eroom's law reigns supreme. It is just economically unviable to discover drugs in the way we do. Despite this, it's clear that bioweapons are regularly researched by government labs. The USG being so eager to fund gof research despite its bad optics should tell you as much. Or maybe they will accidentally ban AI too due to being a dysfunctional autocracy -  I remember many essays from people all over this site on how China wouldn't be able to get to X-1 nm (or the crucial step for it) for decades, and China would always figure a way to get to that nm or step within a few months. They surpassed our chip lithography expectations for them. They are very competent. They are run by probably the most competent government bureaucracy in the world. I don't know what it is, but people keep underestimating China's progress. When they aim their efforts on a target, they almost always achieve it. Rapid progress is a powerful attractor state that requires a global hegemon to stop. China is very keen on the possibilities of AI which is why they stop at nothing to get their hands on Nvidia GPUs. They also have literally no reason to develop a centralized project they are fully in control of. We have superhuman AI that seem quite easy to control already. What is stopping this centralized project on their end. No one is buying that even o3, which is nearly superhuman in ma
3winstonBosan
And for me, the (correct) reframing of RL as the cherry on top of our existing self-supervised stack was the straw that broke my hopeful back. And o3 is more straws to my broken back.
2Noosphere89
Do you mean this is evidence that scaling is really over, or is this the opposite where you think scaling is not over?

Fucking o3. This pace of improvement looks absolutely alarming. I would really hate to have my fast timelines turn out to be right.

The "alignment" technique, "deliberative alignment", is much better than constitutional AI. It's the same during training, but it also teaches the model the safety criteria, and teaches the model to reason about them at runtime, using a reward model that compares the output to their specific safety criteria. (This also suggests something else I've been expecting - the CoT training technique behind o1 doesn't need perfectly verifiable answers in coding and math, it can use a decent guess as to the better answer in what's probably the same procedure).

While safety is not alignment (SINA?), this technique has a lot of promise for actual alignment. By chance, I've been working on an update to my Internal independent review for language model agent alignment, and have been thinking about how this type of review could be trained instead of scripted into an agent as I'd originally predicted.

This is that technique. It does have some promise.

But I don't think OpenAI has really thought through the consequences of using their smarter-than-human models with scaffold... (read more)

Oh dear, RL for everything, because surely nobody's been complaining about the safety profile of doing RL directly on instrumental tasks rather than on goals that benefit humanity.

1Noosphere89
My rather hot take is that a lot of the arguments for safety of LLMs also transfer over to practical RL efforts, with some caveats.
2Charlie Steiner
I agree, after all RLFH was originally for RL agents. As long as the models aren't all that smart, and the tasks they have to do aren't all that long-term, the transfer should work great, and the occasional failure won't be a problem because, again, the models aren't all that smart. To be clear, I don't expect a 'sharp left turn' so much as 'we always implicitly incentivized exploitation of human foibles, we just always caught it when it mattered, until we didn't.'

I was still hoping for a sort of normal life. At least for a decade or maybe more. But that just doesn't seem possible anymore. This is a rough night.

My probably contrarian take is that I don't think improvement on a benchmark of math problems is particularly scary or relevant. It's not nothing -- I'd prefer if it didn't improve at all -- but it only makes me slightly more worried.

5Matt Goldenberg
can you say more about your reasoning for this?

About two years ago I made a set of 10 problems that imo measure progress toward AGI and decided I'd freak out if/when LLMs solve them. They're still 1/10 and nothing has changed in the past year, and I doubt o3 will do better. (But I'm not making them public.)

Will write a reply to this comment when I can test it.

4Matt Goldenberg
can you say the types of problems they are?
4Rafael Harth
You could call them logic puzzles. I do think most smart people on LW would get 10/10 without too many problems, if they had enough time, although I've never tested this.
2Noosphere89
Assuming they are verifiable or have an easy way to verify whether or not a solution does work, I expect o3 to at least get 2/10, if not 3/10 correct under high-compute settings.
1IC Rainbow
What's the last model you did check with, o1-pro?
-1O O
Do you have a link to these?
1gonz
What benchmarks (or other capabilities) do you see as more relevant, and how worried were you before?

Regarding whether this is a new base model, we have the following evidence: 

Jason Wei:

o3 is very performant. More importantly, progress from o1 to o3 was only three months, which shows how fast progress will be in the new paradigm of RL on chain of thought to scale inference compute. Way faster than pretraining paradigm of new model every 1-2 years

Nat McAleese:

o1 was the first large reasoning model — as we outlined in the original “Learning to Reason” blog, it’s “just” an LLM trained with RL. o3 is powered by further scaling up RL beyond o1, and the strength of the resulting model the resulting model is very, very impressive. (2/n)

The prices leaked by ARC-ARG people indicate $60/million output tokens, which is also the current o1 pricing. 33m total tokens and a cost of $2,012. 

Notably, the codeforces graph with pricing puts o3 about 3x higher than o1 (tho maybe it's a secretly log scale), and the ARC-AGI graph has the cost of o3 being 10-20x that of o1-preview. Maybe this indicates it does a bunch more test-time reasoning. That's collaborated by ARC-AGI, average 55k tokens per solution, which seems like a ton. 

I think this evidence indicates this is likely the same b... (read more)

7Vladimir_Nesov
GPT-4o costs $10 per 1M output tokens, so the cost of $60 per 1M tokens is itself more than 6 times higher than it has to be. Which means they can afford to sell a much more expensive model at the same price. It could also be GPT-4.5o-mini or something, similar in size to GPT-4o but stronger, with knowledge distillation from full GPT-4.5o, given that a new training system has probably been available for 6+ months now.

Using $4K per task means a lot of inference in parallel, which wasn't in o1. So that's one possible source of improvement, maybe it's running MCTS instead of individual long traces (including on low settings at $20 per task). And it might be built on the 100K H100s base model.

The scary less plausible option is that RL training scales, so it's mostly o1 trained with more compute, and $4K per task is more of an inefficient premium option on top rather than a higher setting on o3's source of power.

3Zach Stein-Perlman
The obvious boring guess is best of n. Maybe you're asserting that using $4,000 implies that they're doing more than that.

Performance at $20 per task is already much better than for o1, so it can't be just best-of-n, you'd need more attempts to get that much better even if there is a very good verifier that notices a correct solution (at $4K per task that's plausible, but not at $20 per task). There are various clever beam search options that don't need to make inference much more expensive, but in principle might be able to give a boost at low expense (compared to not using them at all).

There's still no word on the 100K H100s model, so that's another possibility. Currently Claude 3.5 Sonnet seems to be better at System 1, while OpenAI o1 is better at System 2, and combining these advantages in o3 based on a yet-unannounced GPT-4.5o base model that's better than Claude 3.5 Sonnet might be sufficient to explain the improvement. Without any public 100K H100s Chinchilla optimal models it's hard to say how much that alone should help.

4RussellThor
Anyone want to guess how capable Claude system level 2 will be when it is polished? I expect better than o3 by a small amt.
2Aaron_Scher
The ARC-AGI page (which I think has been updated) currently says: 

I wish they would tell us what the dark vs light blue means. Specifically, for the FrontierMath benchmark, the dark blue looks like it's around 8% (rather than the light blue at 25.2%). Which like, I dunno, maybe this is nit picking, but 25% on FrontierMath seems like a BIG deal, and I'd like to know how much to be updating my beliefs.

From an apparent author on reddit:

[Frontier Math is composed of] 25% T1 = IMO/undergrad style problems, 50% T2 = grad/qualifying exam style porblems, 25% T3 = early researcher problems

The comment was responding to a claim that Terence Tao said he could only solve a small percentage of questions, but Terence was only sent the T3 questions. 

9Eric Neyman
My random guess is: * The dark blue bar corresponds to the testing conditions under which the previous SOTA was 2%. * The light blue bar doesn't cheat (e.g. doesn't let the model run many times and then see if it gets it right on any one of those times) but spends more compute than one would realistically spend (e.g. more than how much you could pay a mathematician to solve the problem), perhaps by running the model 100 to 1000 times and then having the model look at all the runs and try to figure out which run had the most compelling-seeming reasoning.
4Zach Stein-Perlman
The FrontierMath answers are numerical-ish ("problems have large numerical answers or complex mathematical objects as solutions"), so you can just check which answer the model wrote most frequently.
4Eric Neyman
Yeah, I agree that that could work. I (weakly) conjecture that they would get better results by doing something more like the thing I described, though.
7Alex_Altair
On the livestream, Mark Chen says the 25.2% was achieved "in aggressive test-time settings". Does that just mean more compute?
2Charlie Steiner
It likely means running the AI many times and submitting the most common answer from the AI as the final answer.
1Jonas Hallgren
Extremely long chain of thought, no?
4Alex_Altair
I guess one thing I want to know is like... how exactly does the scoring work? I can imagine something like, they ran the model a zillion times on each question, and if any one of the answers was right, that got counted in the light blue bar. Something that plainly silly probably isn't what happened, but it could be something similar. If it actually just submitted one answer to each question and got a quarter of them right, then I think it doesn't particularly matter to me how much compute it used.
4Zach Stein-Perlman
It was one submission, apparently.
3Alex_Altair
Thanks. Is "pass@1" some kind of lingo? (It seems like an ungoogleable term.)
7Vladimir_Nesov
Pass@k means that at least one of k attempts passes, according to an oracle verifier. Evaluating with pass@k is cheating when k is not 1 (but still interesting to observe), the non-cheating option is best-of-k where the system needs to pick out the best attempt on its own. So saying pass@1 means you are not cheating in evaluation in this way.
4Zach Stein-Perlman
pass@n is not cheating if answers are easy to verify. E.g. if you can cheaply/quickly verify that code works, pass@n is fine for coding.
8Vladimir_Nesov
For coding, a problem statement won't have exhaustive formal requirements that will be handed to the solver, only evals and formal proofs can be expected to have adequate oracle verifiers. If you do have an oracle verifier, you can just wrap the system in it and call it pass@1. Affordance to reliably verify helps in training (where the verifier is applied externally), but not in taking the tests (where the system taking the test doesn't itself have a verifier on hand).

As I say here https://x.com/boazbaraktcs/status/1870369979369128314

Constitutional AI is a great work but Deliberative Alignment is fundamentally different. The difference is basically system 1 vs system 2. In RLAIF ultimately the generative model that answers user prompt is trained with (prompt, good response, bad response). Even if the good and bad responses were generated based on some constitution, the generative model is not taught the text of this constitution, and most importantly how to reason about this text in the context of a particular example.

T... (read more)

4boazbarak
Also the thing I am most excited about deliberative alignment is that it becomes better as models are more capable. o1 is already more robust than o1 preview and I fully expect this to continue. (P.s. apologies in advance if I’m unable to keep up with comments; popped from holiday to post on the DA paper.)
[-]O O710

While I'm not surprised by the pessimism here, I am surprised at how much of it is focused on personal job loss. I thought there would be more existential dread. 

9Vladimir_Nesov
Existential dread doesn't necessarily follow from this specific development if training only works around verifiable tasks and not for everything else, like with chess. Could soon be game-changing for coding and other forms of engineering, without full automation even there and without applying to a lot of other things.
8O O
Oh I guess I was assuming automation of coding would result in a step change in research in every other domain. I know that coding is actually one of the biggest blockers in much of AI research and automation in general.   It might soon become cost effective to write bespoke solutions for a lot of labor jobs for example. 

I just straight up don't believe the codeforces rating. I guess only a small subset of people solve algorithmic problems for fun in their free time, so it's probably opaque to many here, but a rating of 2727 (the one in the table) would be what's called an international grandmaster and is the 176th best rating among all actively competing users on the site. I hope they will soon release details about how they got that performance measure..

9Olli Järviniemi
CodeForces ratings are determined by your performance in competitions, and your score in a competition is determined, in part, by how quickly you solve the problems. I'd expect o3 to be much faster than human contestants. (The specifics are unclear - I'm not sure how a large test-time compute usage translates to wall-clock time - but at the very least o3 parallelizes between problems.) This inflates the results relative to humans somewhat. So one shouldn't think that o3 is in the top 200 in terms of algorithmic problem solving skills.
8ryan_greenblatt
As in, for the literal task of "solve this code forces problem in 30 minutes" (or whatever the competition allows), o3 is ~ top 200 among people who do codeforces (supposing o3 didn't cheat on wall clock time). However, if you gave humans 8 serial hours and o3 8 serial hours, much more than 200 humans would be better. (Or maybe the cross over is at 64 serial hours instead of 8.) Is this what you mean?
7Olli Järviniemi
This is close but not quite what I mean. Another attempt: The literal Do Well At CodeForces task takes the form "you are given ~2 hours and ~6 problems, maximize this score function that takes into account the problems you solved and the times at which you solved them". In this o3 is in top 200 (conditional on no cheating). So I agree there. As you suggest, a more natural task would be "you are given t time and one problem, maximize your probability of solving it in the given time". Already at t equal to ~1 hour (which is what contestants typically spend on the hardest problem they'll solve),  I'd expect o3 to be noticeably worse than top 200. This is because the CodeForces scoring function heavily penalizes slowness, and so if o3 and a human have equal performance in the contests, the human has to make up for their slowness by solving more problems. (Again, this is assuming that o3 is faster than humans in wall clock time.) I separately believe that humans would scale better than AIs w.r.t. t, but that is not the point I'm making here.
1Joel Burget
It's hard to compare across domains but isn't the FrontierMath result similarly impressive?

How have your AGI timelines changed after this announcement?

6Thane Ruthenis
~No update, priced it all in after the Q* rumors first surfaced in November 2023.
5Alexander Gietelink Oldenziel
A rumor is not the same as a demonstration.
5Thane Ruthenis
It is if you believe the rumor and can extrapolate its implications, which I did. Why would I need to wait to see the concrete demonstration that I'm sure would come, if I can instead update on the spot? It wasn't hard to figure out how "something like an LLM with A*/MCTS stapled on top" would look like, or where it'd shine, or that OpenAI might be trying it and succeeding at it (given that everyone in the ML community had already been exploring this direction at the time).
8Alexander Gietelink Oldenziel
Suppose I throw up a coin but I dont show you the answer. Your friend's cousin tells you they think the bias is 80/20 in favor of heads. If I show you the outcome was indeed heads should you still update ? (Yes)
6Thane Ruthenis
Sure. But if you know the bias is 95/5 in favor of heads, and you see heads, you don't update very strongly. And yes, I was approximately that confident that something-like-MCTS was going to work, that it'd demolish well-posed math problems, and that this is the direction OpenAI would go in (after weighing in the rumor's existence). The only question was the timing, and this is mostly within my expectations as well.
5Alexander Gietelink Oldenziel
That's significantly outside the prediction intervals of forecasters so I will need to see an metaculus /manifold/etc account where you explicitly make this prediction sir
7Thane Ruthenis
Fair! Except I'm not arguing that you should take my other predictions at face value on the basis of my supposedly having been right that one time. Indeed, I wouldn't do that without just the sort of receipt you're asking for! (Which I don't have. Best I can do is a December 1, 2023 private message I sent to Zvi making correct predictions regarding what o1-3 could be expected to be, but I don't view these predictions as impressive and it notably lacks credences.) I'm only countering your claim that no internally consistent version of me could have validly updated all the way here from November 2023. You're free to assume that the actual version of me is dissembling or confabulating.
4mattmacdermott
The coin coming up heads is “more headsy” than the expected outcome, but maybe o3 is about as headsy as Thane expected. Like if you had thrown 100 coins and then revealed that 80 were heads.
5Mateusz Bagiński
I guess one's timelines might have gotten longer if one had very high credence that the paradigm opened by o1 is a blind alley (relative to the goal of developing human-worker-omni-replacement-capable AI) but profitable enough that OA gets distracted from its official most ambitious goal. I'm not that person.
4Vladimir_Nesov
$100-200bn 5 GW training systems are now a go. So in the worlds that slow down for years if there are only $30bn systems available and would need an additional scaling push, timelines moved up a few years. Not sure how unlikely $100-200bn systems would've been without o1/o3, but they seem likely now.
1anaguma
What do you think is the current cost of o3, for comparison? 
6Vladimir_Nesov
In the same terms as the $100-200bn I'm talking about, o3 is probably about $1.5-5bn, meaning 30K-100K H100, the system needed to train GPT-4o or GPT-4.5o (or whatever they'll call it) that it might be based on. But that's the cost of a training system, its time needed for training is cheaper (since the rest of its time can be used for other things). In the other direction, it's more expensive than just that time because of research experiments. If OpenAI spent $3bn in 2024 on training, this is probably mostly research experiments.

First post, feel free to meta-critique my rigor on this post as I am not sure what is mandatory, expected, or superfluous for a comment under a post. Studying computer science but have no degree, yet. Can pull the specific citation if necessary, but...

these benchmarks don't feel genuine.

Chollet indicated in his piece:

Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score ov

... (read more)
3Zach Stein-Perlman
Welcome! To me the benchmark scores are interesting mostly because they suggest that o3 is substantially more powerful than previous models. I agree we can't naively translate benchmark scores to real-world capabilities.
1yo-cuddles
Thank you for the warm reply, it's nice and also good feedback I didn't do anything explicitly wrong with my post It will be VERY funny if this ends up being essentially the o1 model with some tinkering to help it cycle questions multiple times to verify the best answers, or something banal like that. Wish they didn't make us wait so long to test that :/
3Lukas_Gloor
Well, the update for me would go both ways.  On one side, as you point out, it would mean that the model's single pass reasoning did not improve much (or at all).  On the other side, it would also mean that you can get large performance and reliability gains (on specific benchmarks) by just adding simple stuff. This is significant because you can do this much more quickly than the time it takes to train a new base model, and there's probably more to be gained in that direction – similar tricks we can add by hardcoding various "system-2 loops" into the AI's chain of thought and thinking process.  You might reply that this only works if the benchmark in question has easily verifiable answers. But I don't think it is limited to those situations. If the model itself (or some subroutine in it) has some truth-tracking intuition about which of its answer attempts are better/worse, then running it through multiple passes and trying to pick the best ones should get you better performance even without easy and complete verifiability (since you can also train on the model's guesses about its own answer attempts). Besides, I feel like humans do something similar when we reason: we think up various ideas and answer attempts and run them by an inner critic, asking "is this answer I just gave actually correct/plausible?" or "is this the best I can do, or am I missing something?." (I'm not super confident in all the above, though.) Lastly, I think the cost bit will go down by orders of magnitude eventually (I'm confident of that). I would have to look up trends to say how quickly I expect $4,000 in runtime costs to go down to $40, but I don't think it's all that long. Also, if you can do extremely impactful things with some model, like automating further AI progress on training runs that cost billions, then willingness to pay for model outputs could be high anyway. 
1yo-cuddles
I sense that my quality of communication diminishes past this point, I should get my thoughts together before speaking too confidently I believe you're right we do something similar to the LLM's (loosely, analogously), see https://www.lesswrong.com/posts/i42Dfoh4HtsCAfXxL/babble (I need to learn markdown) My intuition is still LLM pessimistic, I'd be excited to see good practical uses, this seems like tool ai and that makes my existential dread easier to manage!

That's some significant progress, but I don't think will lead to TAI. 

However there is a realistic best case scenario where LLM/Transformer stop just before and can give useful lessons and capabilities. 

I would really like to see such an LLM system get as good as a top human team at security, so it could then be used  to inspect and hopefully fix masses of security vulnerabilities. Note that could give a false sense of security, unknown unknown type situation where it would't find a totally new type of attack, say a combined SW/HW attack like Rowhammer/Meltdown but more creative. A superintelligence not based on LLM could however.

OpenAI didn't say what the light blue bar is

Presumably light blue is o3 high, and dark blue is o3 low?

2Zach Stein-Perlman
I think they only have formal high and low versions for o3-mini Edit: nevermind idk

For people who don't expect a strong government response... remember that Elon is First Buddy now. 🎢

Beating benchmarks, even very difficult ones, is all find and dandy, but we must remember that those tests, no matter how difficult, are at best only a limited measure of human ability. Why? Because they present the test-take with a well-defined situation to which they must respond. Life isn't like that. It's messy and murky. Perhaps the most difficult step is to wade into the mess and the murk and impose a structure on it – perhaps by simply asking a question – so that one can then set about dealing with that situation in terms of the imposed structure. T... (read more)