o3, Oh My

Zvi

OpenAI presented o3 on the Friday before Christmas, at the tail end of the 12 Days of Shipmas.

I was very much expecting the announcement to be something like a price drop. What better way to say ‘Merry Christmas,’ no?

They disagreed. Instead, we got this (here’s the announcement, in which Sam Altman says ‘they thought it would be fun’ to go from one frontier model to their next frontier model, yeah, that’s what I’m feeling, fun):

Greg Brockman (President of OpenAI): o3, our latest reasoning model, is a breakthrough, with a step function improvement on our most challenging benchmarks. We are starting safety testing and red teaming now.

Nat McAleese (OpenAI): o3 represents substantial progress in general-domain reasoning with reinforcement learning—excited that we were able to announce some results today! Here is a summary of what we shared about o3 in the livestream.

o1 was the first large reasoning model—as we outlined in the original “Learning to Reason” blog, it is “just” a LLM trained with reinforcement learning. o3 is powered by further scaling up reinforcement learning beyond o1, and the resulting model’s strength is very impressive.

First and foremost: We tested on recent, unseen programming competitions and found that the model would rank among some of the best competitive programmers in the world, with an estimated CodeForces rating of over 2,700.

This is a milestone (Codeforces rating better than Jakub Pachocki) that I thought was further away than December 2024; these competitions are difficult and highly competitive; the model is extraordinarily good.

Scores are impressive elsewhere, too. 87.7% on the GPQA diamond benchmark surpasses any LLM I am aware of externally (I believe the non-o1 state-of-the-art is Gemini Flash 2 at 62%?), as well as o1’s 78%. An unknown noise ceiling exists, so this may even underestimate o3’s scientific advancements over o1.

o3 can also perform software engineering, setting a new state of the art on SWE-bench, achieving 71.7%, a substantial improvement over o1.

With scores this strong, you might fear accidental contamination. Avoiding this is something OpenAI is obviously focused on; but thankfully, we also have some test sets that are strongly guaranteed to be uncontaminated: ARC and FrontierMath… What do we see there?

Well, on FrontierMath 2024-11-26, o3 improved the state of the art from 2% to 25% accuracy. These are extremely difficult, well-established, held-out math problems. And on ARC, the semi-private test set and public validation set scores are 87.5% (private) and 91.5% (public). [thread continues]

…

The models will only get better with time; and virtually no one (on a large scale) can still beat them at programming competitions or mathematics. Merry Christmas!

Zac Stein-Perlman has a summary post of the basic facts. Some good discussions in the comments.

Up front, I want to offer my sincere thanks for this public safety testing phase, and for putting that front and center in the announcement. You love to see it. See the last three minutes of that video, or the sections on safety later on.

GPQA Has Fallen

Codeforces Has Fallen

Deedy: OpenAI o3 is 2727 on Codeforces which is equivalent to the #175 best human competitive coder on the planet.

This is an absolutely superhuman result for AI and technology at large.

The median IOI Gold medalist, the top international programming contest for high schoolers, has a rating of 2469.

That’s how incredible this result is.

In the presentation, Altman jokingly mentions that one person at OpenAI is a competition programmer who is 3000+ on Codeforces, so ‘they have a few more months’ to enjoy their superiority. Except, he’s obviously not joking. Gulp.

Arc Has Kinda of Fallen But For Now Only Kinda

o3 shows dramatically improved performance on the ARC-AGI challenge.

Francois Chollet offers his thoughts, full version here.

Arc Prize: New verified ARC-AGI-Pub SoTA! @OpenAI o3 has scored a breakthrough 75.7% on the ARC-AGI Semi-Private Evaluation.

And a high-compute o3 configuration (not eligible for ARC-AGI-Pub) scored 87.5% on the Semi-Private Eval.

This performance on ARC-AGI highlights a genuine breakthrough in novelty adaptation.

This is not incremental progress. We’re in new territory.

Is it AGI? o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.

hero: o3’s secret? the “I will give you $1k if you complete this task correctly” prompt but you actually send it the money.

Rohit: It’s actually Sam in the back end with his venmo.

Is there a catch?

There’s at least one big catch, which is that they vastly exceeded the compute limit for what counts as a full win for the ARC challenge. Those yellow dots represent quite a lot more money spent, o3 high is spending thousands of dollars.

It is worth noting that $0.10 per problem is a lot cheaper than human level.

Ajeya Cotra: I think a generalist AI system (not fine-tuned on ARC AGI style problems) may have to be pretty *superhuman* to solve them at $0.10 per problem; humans have to run a giant (1e15 FLOP/s) brain, probably for minutes on the more complex problems.

Beyond that, is there another catch? That’s a matter of some debate.

Even with catches, the improvements are rather mind-blowing.

President of the Arc prize Greg Kamradt verified the result.

Greg Kamradt: We verified the o3 results for OpenAI on @arcprize.

My first thought when I saw the prompt they used to claim their score was…

“That’s it?”

It was refreshing (impressive) to see the prompt be so simple:

“Find the common rule that maps an input grid to an output grid.”

Brandon McKinzie (OpenAI): to anyone wondering if the high ARC-AGI score is due to how we prompt the model: nah. I wrote down a prompt format that I thought looked clean and then we used it…that’s the full story.

Pliny the Liberator: can I try?

For fun, here are the 34 problems o3 got wrong. It’s a cool problem set.

And this progress is quite a lot.

It is not, however, a direct harbinger of AGI, one does not want to overreact.

Noam Brown (OpenAI): I think people are overindexing on the @OpenAI o3 ARC-AGI results. There’s a long history in AI of people holding up a benchmark as requiring superintelligence, the benchmark being beaten, and people being underwhelmed with the model that beat it.

To be clear, @fchollet and @mikeknoop were always very clear that beating ARC-AGI wouldn’t imply AGI or superintelligence, but it seems some people assumed that anyway.

Here is Melanie Mitchell giving an overview that seems quite good.

Except, oh no!

They Trained on the Train Set

How dare they!

OpenAI: Note on “tuned”” OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more detail. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.

Niels Rogge: By training on 75% of the training set.

Gary Marcus: Wow. This, if true, raises serious questions about yesterday’s announcement.

Roon: oh shit oh f*** they trained on the train set it’s all over now

Also important to note that 75% of the train set is like 2-300 examples.

SCANDAL

OpenAI trained on the train set for the Millenium Puzzles.

Johan: Given that it scores 30% on ARC AGI 2, it’s clear there was no improvement in fluid reasoning and the only gain was due to the previous model not being trained on ARC.

Roon: well the other benchmarks show improvements in reasoning across the board

but regardless, this mostly reveals that it’s real performance on ARC AGI 2 is much higher

Rythm Garg: also: the model we used for all of our o3 evals is fully general; a subset of the arc-agi public training set was a tiny fraction of the broader o3 train distribution, and we didn’t do any additional domain-specific fine-tuning on the final checkpoint

Emmett Shear: Were anyone on the team aware of and thinking about arc and arc-like problems as a domain to improve at when you were designing and training o3? (The distinction between succeeding as a random side effect and succeeding with intention)

Rythm Garg: no, the team wasn’t thinking about arc when training o3; people internally just see it as one of many other thoughtfully-designed evals that are useful for monitoring real progress

Or:

Gary Marcus doubled down on ‘the true AGI would not need to train on the train set.’

Previous SotA on ARC involved training not only on the test set, but on a much larger synthetic test set. ARC was designed so the AI wouldn’t need to train for it, but it turns out ‘test that you can’t train for’ is a super hard trick to pull off. This was an excellent try and it still didn’t work.

If anything, o3’s using only 300 training set problems, and using a very simple instruction, seems to be to its credit here.

The true ASI might not need to do it, but why wouldn’t you train on the train set as a matter of course, even if you didn’t intend to test on ARC? That’s good data. And yes, humans will reliably do some version of ‘train on at least some of the train set’ if they want to do well on tasks.

Is it true we will be a lot better off if we have AIs that can one-shot problems that are out of their training distributions, where they truly haven’t seen anything that resembles the problem? Well, sure. That would be more impressive.

The real objection here, as I understand it, is the claim that OpenAI presented these results as more impressive than they are.

The other objection is that this required quite a lot of compute.

That is a practical problem. If you’re paying $20 a shot to solve ARC problems, or even $1m+ for the whole test at the high end, pretty soon you are talking real money.

It also raises further questions. What about ARC is taking so much compute? At heart these problems are very simple. The logic required should, one would hope, be simple.

Mike Bober-Irizar: Why do pre-o3 LLMs struggle with generalization tasks like

@arcprize? It’s not what you might think.

OpenAI o3 shattered the ARC-AGI benchmark. But the hardest puzzles didn’t stump it because of reasoning, and this has implications for the benchmark as a whole.

LLMs are dramatically worse at ARC tasks the bigger they get. However, humans have no such issues – ARC task difficulty is independent of size.

Most ARC tasks contain around 512-2048 pixels, and o3 is the first model capable of operating on these text grids reliably.

So even if a model is capable of the reasoning and generalization required, it can still fail just because it can’t handle this many tokens.

When testing o1-mini on an enlarged version of ARC, we observe an 80% drop in solved tasks – even if the solutions are the same.

When models can’t understand the task format, the benchmark can mislead, introducing a hidden threshold effect.

And if there’s always a larger version that humans can solve but an LLM can’t, what does this say about scaling to AGI?

The implication is that o3’s ability to handle the size of the grids might be producing a large threshold effect. Perhaps most of why o3 does so well is that it can hold the presented problem ‘in its head’ at once. That wouldn’t be as big a general leap.

Roon: arc is hard due to perception rather than reasoning -> seems clear and shut

AIME Has Fallen

I remember when AIME problems were hard.

This one is not a surprise. It did definitely happen.

AIME hasn’t quite fully fallen, in the sense that this does not solve AIME cheap. But it does solve AIME.

Frontier of Frontier Math Shifting Rapidly

Back in the before times on November 8, Epoch AI launched FrontierMath, a new benchmark designed to fix the saturation on existing math benchmarks, eliciting quotes like this one:

Terrence Tao (Fields Medalist): These are extremely challenging… I think they will resist AIs for several years at least.

Timothy Gowers (Fields Medalist): Getting even one question right would be well beyond what we can do now, let alone saturating them.

Evan Chen (IMO Coach): These are genuinely hard problems… most of them look well above my pay grade.

At the time, no model solved more than 2% of these questions. And then there’s o3.

Noam Brown: This is the result I’m most excited about. Even if LLMs are dumb in some ways, saturating evals like @EpochAIResearch’s Frontier Math would suggest AI is surpassing top human intelligence in certain domains. When that happens we may see a broad acceleration in scientific research.

This also means that AI safety topics like scalable oversight may soon stop being hypothetical. Research in these domains needs to be a priority for the field.

Tamay Besiroglu: I’m genuinely impressed by OpenAI’s 25.2% Pass@1 performance on FrontierMath—this marks a major leap from prior results and arrives about a year ahead of my median expectations.

For context, FrontierMath is a brutally difficult benchmark with problems that would stump many mathematicians. The easier problems are as hard as IMO/Putnam; the hardest ones approach research-level complexity.

With earlier models like o1-preview, Pass@1 performance (solving on first attempt) was only around 2%. When allowing 8 attempts per problem (Pass@8) and counting problems solved at least once, we saw ~6% performance. o3’s 25.2% at Pass@1 is substantially more impressive.

It’s important to note that while the average problem difficulty is extremely high, FrontierMath problems vary in difficulty. Roughly: 25% are Tier 1 (advanced IMO/Putnam level), 50% are Tier 2 (extremely challenging grad-level), and 25% are Tier 3 (research problems).

…

I previously predicted a 25% performance by Dec 31, 2025 (my median forecast with an 80% CI of 14–60%). o3 has reached it earlier than I’d have expected on average.

It is indeed rather crazy how many people only weeks ago thought this level of Frontier Math was a year or more away.

Therefore…

FrontierMath 4: We’re Going To Need a Bigger Benchmark

When the FrontierMath is about to no longer be beyond the frontier, find a few frontier. Fast.

Tammy Besiroglu (6:52m, December 21, 2024): I’m excited to announce the development of Tier 4, a new suite of math problems that go beyond the hardest problems in FrontierMath. o3 is remarkable, but there’s still a ways to go before any single AI system nears the collective prowess of the math community.

Elliot Glazer (6:30pm, December 21, 2024): For context, FrontierMath currently spans three broad tiers:

• T1 (25%) Advanced, near top-tier undergrad/IMO

• T2 (50%) Needs serious grad-level background

• T3 (25%) Research problems demanding relevant research experience

All can take hours—or days—for experts to solve.

Although o3 solved problems in all three tiers, it likely still struggles on the most formidable Tier 3 tasks—those “exceptionally hard” challenges that Tao and Gowers say can stump even top mathematicians.

Tier 4 aims to push the boundary even further. We want to assemble problems so challenging that solving them would demonstrate capabilities on par with an entire top mathematics department.

Each problem will be composed by a team of 1-3 mathematicians specialized in the same field over a 6-week period, with weekly opportunities to discuss ideas with teams in related fields. We seek broad coverage of mathematics and want all major subfields represented in Tier 4.

Process for a Tier 4 problem:

1 week crafting a robust problem concept, which “converts” research insights into a closed-answer problem.

3 weeks of collaborative research. Presentations among related teams for feedback.

Two weeks for the final submission.

We’re seeking mathematicians who can craft these next-level challenges. If you have research-grade ideas that transcend T3 difficulty, please email elliot@epoch.ai with your CV and a brief note on your interests.

We’ll also hire some red-teamers, tasked with finding clever ways a model can circumvent a problem’s intended difficulty, and some reviewers to check for mathematical correctness of final submissions. Contact me if you think you’re suitable for either such role.

As AI keeps improving, we need benchmarks that reflect genuine mathematical depth. Tier 4 is our next (and possibly final) step in that direction.

Tier 5 could presumably be ‘ask a bunch of problems we have actual no idea how to solve and that might not have solutions but that would be super cool’ since anything on a benchmark inevitably gets solved.

What is o3 Under the Hood?

From the description here, Chollet and Masad are speculating. It’s certainly plausible, but we don’t know if this is on the right track. It’s also highly plausible, especially given how OpenAI usually works, that o3 is deeply similar to o1, only better, similarly to how the GPT line evolved.

Amjad Masad: Based on benchmarks, OpenAI’s o3 seems like a genuine breakthrough in AI.

Maybe a start of a new paradigm.

But what new is also old: under the hood it might be Alpha-zero-style search and evaluate.

The author of ARC-AGI benchmark @fchollet speculates on how it works.

Davidad (other thread): o1 doesn’t do tree search, or even beam search, at inference time. it’s distilled. what about o3? we don’t know—those inference costs are very high—but there’s no inherent reason why it must be un-distill-able, since Transformers are Turing-complete (with the CoT itself as tape)

Teortaxes: I am pretty sure that o3 has no substantial difference from o1 aside from training data.

Jessica Taylor sees this as vindicating Paul Christiano’s view that you can factor cognition and use that to scale up effective intelligence.

Jessica Taylor: o3 implies Christiano’s factored cognition work is more relevant empirically; yes, you can get a lot from factored cognition.

Potential further capabilities come through iterative amplification and distillation, like ALBA.

If you care about alignment, go read Christiano!

I agree with that somewhat. I’m confused how far to go with it.

If we got o3 primarily because we trained on synthetic data that was generated by o1… then that is rather directly a form of slow takeoff and recursive self-improvement.

(Again, I don’t know if that’s what happened or not.)

Not So Fast!

And I don’t simply mean that the full o3 is not so fast, which it indeed is not:

Noam Brown: We announced @OpenAI o1 just 3 months ago. Today, we announced o3. We have every reason to believe this trajectory will continue.

Poaster Child: Waiting for singularity bros to discover economics.

Noam Brown: I worked at the federal reserve for 2 years.

I am waiting for economists to discover various things, Noam Brown excluded.

Jason Wei (OpenAI): o3 is very performant. More importantly, progress from o1 to o3 was only three months, which shows how fast progress will be in the new paradigm of RL on chain of thought to scale inference compute. Way faster than pretraining paradigm of new model every 1-2 years.

Scary fast? Absolutely.

However, I would caution (anti-caution?) that this is not a three month (~100 day) gap. On September 12, they gave us o1-preview to use. Presumably that included them having run o1-preview through their safety testing.

Davidad: If using “speed from o1 announcement to o3 announcement” to calibrate your velocity expectations, do take note that the o1 announcement was delayed by safety testing (and many OpenAI releases have been delayed in similar ways), whereas o3 was announced prior to safety testing.

They are only now starting o3 safety testing, from the sound of it this includes o3-mini. Even the red teamers won’t get full o3 access for several weeks. Thus, we don’t know how long this later process will take, but I would put the gap closer to 4-5 months.

That is still, again, scary fast.

It is however also the low hanging fruit, on two counts.

We went from o1 → o3 in large part by having it spend over $1,000 on tasks. You can’t pull that trick that many more times in a row. The price will come down over time, and o3 is clearly more efficient than o1, so yes we will still make progress here, but there aren’t that many tasks where you can efficiently spend $10k+ on a slow query, especially if it isn’t reliable.
This is a new paradigm of how to set up an AI model, so it should be a lot easier to find various algorithmic improvements.

Thus, if o3 isn’t so good that it substantially accelerates AI R&D that goes towards o4, then I would expect an o4 that expresses a similar jump to take substantially longer. The question is, does o3 make up for that with its contribution to AI R&D? Are we looking at a slow takeoff situation?

Even if not, it will still get faster and cheaper. And that alone is huge.

Deep Thought

As in, this is a lot like that computer Douglas Adams wrote about, where you can get any answer you want, but it won’t be either cheap or fast. And you really, really should have given more thought to what question you were asking.

Ethan Mollick: Basically, think of the O3 results as validating Douglas Adams as the science fiction author most right about AI.

When given more time to think, the AI can generate answers to very hard questions, but the cost is very high, and you have to make sure you ask the right question first.

And the answer is likely to be correct (but we cannot be sure because verifying it requires tremendous expertise).

He also was right about machines that work best when emotionally manipulated and machines that guilt you.

Sully: With O3 costing (potentially) $2,000 per task on “high compute,” the app layer is needed more than ever.

For example, giving the wrong context to it and you just burned $1,000.

Likely, we have a mix of models based on their pricing/intelligence at the app layer, prepping the data to feed it into O3.

100% worth the money but the last thing u wana do is send the wrong info lol

Douglas Adams had lots of great intuitions and ideas, he’s amazing, but also he had a lot of shots on goal.

Our Price Cheap

Right now o3 is rather expensive, although o3-mini will be cheaper than o1.

That doesn’t mean o3-level outputs will stay expensive, although presumably once they are people will try for o4-level or o5-level outputs, which will be even more expensive despite the discounts.

Seb Krier: Lots of poor takes about the compute costs to run o3 on certain tasks and how this is very bad, lead to inequality etc.

This ignores how quickly these costs will go down over time, as they have with all other models; and ignores how AI being able to do things you currently have to pay humans orders of magnitude more to do will actually expand opportunity far more compared to the status quo.

Remember when early Ericsson phones were a quasi-luxury good?

Simeon: I think this misses the point that you can’t really buy a better iPhone even with $1M whereas you can buy more intelligence with more capital (which is why you get more inequalities than with GPT-n). You’re right that o3 will expand the pie but it can expand both the size of the pie and inequalities.

Seb Krier: An individual will not have the same demand for intelligence as e.g. a corporation. Your last sentence is what I address in my second point. I’m also personally less interested in inequality/the gap than poverty/opportunity etc.

Most people will rarely even want an o3 query in the first place, they don’t have much use for that kind of intelligence in the day to day. Most queries are already pretty easy to handle with Claude Sonnet, or even Gemini Flash.

You can’t use $1m to buy a superior iPhone. But suppose you could, and every time you paid 10x the price the iPhone got modestly better (e.g. you got an iPhone x+2 or something). My instinctive prediction is a bunch of rich people pay $10k or $100k and a few pay $1m or $10m but mostly no one cares.

This is of course different, and relative access to intelligence is a key factor, but it’s miles less unequal than access to human expertise.

To the extent that people do need that high level of artificial intelligence, it’s mostly a business expense, and as such it is actually remarkably cheap already. It definitely reduces ‘intelligence inequality’ in the sense that getting information or intelligence that you can’t provide yourself will get a lot cheaper and easier to access. Already this is a huge effect – I have lots of smart and knowledgeable friends but mostly I use the same tools everyone else could use, if they knew about them.

Still, yes, some people don’t love this.

Haydn Belfield: o1 & o3 bring to an end the period when everyone—from Musk to me—could access the same quality of AI model.

From now on, richer companies and individuals will be able to pay more for inference compute to get better results.

Further concentration of wealth and power is coming.

Inference cost *will* decline quickly and significantly. But this will not change the fact that this paradigm enables converting money into outcomes.

Lower costs for everyone mean richer companies can buy even more.

Companies will now feel confident to invest 10–100 milliseconds into inference compute.

This is a new way to convert money into better outcomes, so it will advantage those with more capital.

Even for a fast-growing, competent startup, it is hard to recruit and onboard many people quickly at scale.

o3 is like being able to scale up world-class talent.

Rich companies are talent-constrained. It takes time and effort to scale a workforce, and it is very difficult to buy more time or work from the best performers. This is a way to easily scale up talent and outcomes simply by using more money!

Some people in replies are saying “twas ever thus”—not for most consumer technology!

Musk cannot buy a 100 times better iPhone, Spotify, Netflix, Google search, MacBook, or Excel, etc.

He can buy 100 times better legal, medical, or financial services.

AI has now shifted from the first group to the second.

Musk cannot buy 100 times better medical or financial services. What he can do is pay 100 times more, and get something 10% better. Maybe 25% better. Or, quite possibly, 10% worse, especially for financial services. For legal he can pay 100 times more and get 100 times more legal services, but as we’ve actually seen it won’t go great.

And yes, ‘pay a human to operate your consumer tech for you’ is the obvious way to get superior consumer tech. I can absolutely get a better Netflix or Spotify or search by paying infinitely more money, if I want that, via this vastly improved interface.

And of course I could always get a vastly better computer. If you’re using a MacBook and you are literally Elon Musk that is pretty much on you.

The ‘twas ever thus’ line raises the question of what type of product AI is supposed to be. If it’s a consumer technology, then for most purposes, I still think we end up using the same product.

If it’s a professional service used in doing business, then it was already different. The same way I could hire expensive lawyers, I could have hired a prompt engineer or SWEs to build me agents or what not, if I wanted that.

I find Altman’s framing interesting here, and important:

Sam Altman: seemingly somewhat lost in the noise of today.

On many coding tasks, o3-mini will outperform o1 at a massive cost reduction!

I expect this trend to continue, but also that the ability to get marginally more performance for exponentially more money will be truly strange.

Exponentially more money for marginally more performance.

Over time, massive cost reductions.

In a sense, the extra money is buying you living in the future.

Do you want to live in the future, before you get the cost reductions?

In some cases, very obviously yes, you do.

Has Software Engineering Fallen?

I would not say it has fallen. I do know it will transform.

If two years from now you are writing code line by line, you’ll be a dinosaur.

Sully: yeah its over for coding with o3

this is mindboggling

looks like the first big jump since gpt4, because these numbers make 0 sense

By the way, I don’t say this lightly, but

Software engineering in the traditional sense is dead in less than two years.

You will still need smart, capable engineers.

But anything that involves raw coding and no taste is done for.

o6 will build you virtually anything.

Still Bullish on things that require taste (design and such)

The question is, assuming the world ‘looks normal,’ will you still need taste? You’ll need some kind of taste. You still need to decide what to build. But the taste you need will presumably get continuously higher level and more abstract, even within design.

Don’t Quit Your Day Job

If you’re in AI capabilities, pivot to AI safety.

If you’re in software engineering, pivot to software architecting.

If you’re in working purely for a living, pivot to building things and shipping them.

But otherwise, don’t quit your day job.

Null Pointered (6.4m views): If you are a software engineer who’s three years into your career: quit now. there is not a single job in CS anymore. it’s over. this field won’t exist in 1.5 years.

Anthony F: This is the kind of though that will make the software engineers valuable in 1.5 years.

null: That’s what I’m hoping.

Robin Hanson: I would bet against this.

If anything, being in software should make you worry less.

Pavel Asparouhov: Non technical folk saying the SWEs are cooked — it’s you guys who are cooked.

Ur gonna have ex swes competing with everything you’re doing now, and they’re gonna be AI turbocharged

Engineers were simply doing coding bc it was the highest leverage use of mental power

When that shifts it’s not going to all of the sudden shift the hierarchy

They’ll still be (higher level) SWEs. Instead of coding, they’ll be telling the AI to code.

And they will absolutely be competing with you.

If you don’t join them, you are probably going to lose.

Here’s some advice that I agree with in spirit, except that if you choose not to decide you still have made a choice, so you do the best you can, notice he gives advice anyway:

Roon: Nobody should give or receive any career advice right now. Everyone is broadly underestimating the scope and scale of change and the high variance of the future. Your L4 engineer buddy at Meta telling you “bro, CS degrees are cooked” doesn’t know anything.

Greatness cannot be planned.

Stay nimble and have fun.

It’s an exciting time. Existing status hierarchies will collapse, and the creatives will win big.

Roon: guy with zero executive function to speak of “greatness cannot be planned”

Simon Sarris: I feel like I’m going insane because giving advice to new devs is not that hard.

Build things you like preferably publicly with your real name

Have a website that shows something neat

Help other people publicly. Participate in social media socially.

Do you notice how “AI” changes none of this?

Wailing about because of some indeterminate future and claiming that there’s no advice that can be given to noobs are both breathlessly silly. Think about what you’re being asked for at least ten seconds. You can really think of nothing to offer? Nothing?

Master of Your Domain

Ajeya Cotra: I wonder if an o3 agent could productively work on projects with poor feedback loops (eg “research X topic”) for many subjective years without going off the rails or hitting a degenerate loop. Even if it’s much less cost-efficient now it would quickly become cheaper.

Another situation where onlookers/forecasters probably disagree a lot about *today’s* capabilities let alone future capabilities.

Wonder how o3 would do on wedding planning.

Note the date on that poll, it is prior to o3.

I predict that o3 with reasonable tool use and other similar scaffolding, and a bunch of engineering work to get all that set up (but it would almost all be general work, it mostly wouldn’t need to be wedding specific work, and a lot of it could be done by o3!) would be great at planning ‘a’ wedding. It can give you one hell of a wedding. But you don’t want ‘a’ wedding. You want your wedding.

The key is handling the humans. That would mean keeping the humans in the loop properly, ensuring they give the right feedback that allows o3 to stay on track and know what is actually desired. But it would also mean all the work a wedding planner does to manage the bride and sometimes groom, and to deal with issues on-site.

If you give it an assistant (with assistant planner levels of skill) to navigate various physical issues and conversations and such, then the problem becomes trivial. Which in some sense also makes it not a good test, but also does mean your wedding planner is out of a job.

So, good question, actually. As far as we know, no one has dared try.

Safety Third

The bar for safety testing has gotten so low that I was genuinely happy to see Greg Brockman say that safety testing and red teaming was starting now. That meant they were taking testing seriously!

When they tested the original GPT-4, under far less dangerous circumstances, for months. Whereas with o3, it could possibly have already been too late.

Take Eliezer Yudkowsky’s warning here both seriously and literally:

Greg Brockman: o3, our latest reasoning model, is a breakthrough, with a step function improvement on our hardest benchmarks. we are starting safety testing & red teaming now.

Eliezer Yudkowsky: Sir, this level of capabilities needs to be continuously safety-tested while you are training it on computers connected to the Internet (and to humans). You are past the point where it seems safe to train first and conduct evals only before user releases.

RichG (QTing EY above): I’ve been avoiding politics and avoiding tribe like things like putting in my name, but level of lack of paranoia that these labs have is just plain worrying. I think I will put in my name now.

Was it probably safe in practice to train o3 under these conditions? Sure. You definitely had at least one 9 of safety doing this (p(safe)>90%). It would be reasonable to claim you had two (p(safe)>99%) at the level we care about.

Given both kinds of model uncertainty, I don’t think you had three.

If humans are reading the outputs, or if o3 has meaningful outgoing internet access, and it turns out you are wrong about it being safe to train it under those conditions… the results could be catastrophically bad, or even existentially bad.

You don’t do that because you expect we are in that world yet. We almost certainly aren’t. You do that because there is a small chance that we are, and we can’t afford to be wrong about this.

That is still not the current baseline threat model. The current baseline threat model remains that a malicious user uses o3 to do something for them that we do not want o3 to do.

Xuan notes she’s pretty upset about o3’s existence, because she thinks it is rather unsafe-by-default and was hoping the labs wouldn’t build something like this, and then was hoping it wouldn’t scale easily. And that o3 seems to be likely to engage in open-ended planning, operate over uninterpretable world models, and be situationally aware, and otherwise be at high risk for classic optimization-based AI risks. She’s optimistic this can be solved, but time might be short.

I agree that o3 seems relatively likely to be highly unsafe-by-default in existentially dangerous ways, including ways illustrated by the recent Redwood Research and Anthropic paper, Alignment Faking in Large Language Models. It builds in so many of the preconditions for such behaviors.

Davidad: “Maybe the AI capabilities researchers aren’t very smart” is a very very hazardous assumption on which to pin one’s AI safety hopes

I don’t mean to imply it’s *pointless* to keep AI capabilities ideas private. But in my experience, if I have an idea, at least somebody in one top lab will have the same idea by next quarter, and someone in academia or open source will have the idea and publish within 1-2 years.

A better hope [is to solve the practical safety problems, e.g. via interpretability.]

I am not convinced, at least for my own purposes, although obviously most people will be unable to come up with valuable insights here. I think salience of ideas is a big deal, people don’t do things, and yes often I get ideas that seem like they might not get discovered forever otherwise. Doubtless a lot of them are because ‘that doesn’t work, either because we tried it and it doesn’t or it obviously doesn’t you idiot’ but I’m fine with not knowing which ones are which.

I do think that the rationalist or MIRI crowd made a critical mistake in the 2010s of thinking they should be loud about the dangers of AI in general, but keep their technical ideas remarkably secret even when it was expensive. It turned out it was the opposite, the technical ideas didn’t much matter in the long run (probably?) but the warnings drew a bunch of interest. So there’s that.

Certainly now is not the time to keep our safety concerns or ideas to ourselves.

The Safety Testing Program

Thus, you are invited to their early access safety testing.

OpenAI: We’re inviting safety researchers to apply for early access to our next frontier models. This early access program complements our existing frontier model testing process, which includes rigorous internal safety testing, external red teaming such as our Red Teaming Network⁠ and collaborations with third-party testing organizations, as well the U.S. AI Safety Institute and the UK AI Safety Institute.

As models become more capable, we are hopeful that insights from the broader safety community can bring fresh perspectives, deepen our understanding of emerging risks, develop new evaluations, and highlight areas to advance safety research.

As part of 12 Days of OpenAI⁠, we’re opening an application process for safety researchers to explore and surface the potential safety and security implications of the next frontier models.

Safety testing in the reasoning era

Models are becoming more capable quickly, which means that new threat modeling, evaluation, and testing techniques are needed. We invest heavily in these efforts as a company, such as designing new measurement techniques under our Preparedness Framework⁠(opens in a new window), and are focused on areas where advanced reasoning models, like our o-series, may pose heightened risks. We believe that the world will benefit from more research relating to threat modeling, security analysis, safety evaluations, capability elicitation, and more

Early access is flexible for safety researchers. You can explore things like:

Developing Robust Evaluations: Build evaluations to assess previously identified capabilities or potential new ones with significant security or safety implications. We encourage researchers to explore ideas that highlight threat models that identify specific capabilities, behaviors, and propensities that may pose concrete risks tied to the evaluations they submit.

Creating Potential High-Risk Capabilities Demonstrations: Develop controlled demonstrations showcasing how reasoning models’ advanced capabilities could cause significant harm to individuals or public security absent further mitigation. We encourage researchers to focus on scenarios that are not possible with currently widely adopted models or tools.

Examples of evaluations and demonstrations for frontier AI systems:

Evaluating frontier AI R&D capabilities of language model agents against human experts⁠(opens in a new window)

Scheming reasoning evaluations⁠(opens in a new window)

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents⁠(opens in a new window)

We hope these insights will surface valuable findings and contribute to the frontier of safety research more broadly. This is not a replacement for our formal safety testing or red teaming processes.

How to apply

Submit your application for our early access period, opening December 20, 2024, to push the boundaries of safety research. We’ll begin selections as soon as possible thereafter. Applications close on January 10, 2025.

Sam Altman: if you are a safety researcher, please consider applying to help test o3-mini and o3. excited to get these out for general availability soon.

extremely proud of all of openai for the work and ingenuity that went into creating these models; they are great.

(and most of all, excited to see what people will build with this!)

If early testing of the full o3 will require a delay of multiple weeks for setup, then that implies we are not seeing the full o3 in January. We probably see o3-mini relatively soon, then o3 follows up later.

This seems wise in any case. Giving the public o3-mini is one of the best available tests of the full o3. This is the best form of iterative deployment. What the public does with o3-mini can inform what we look for with o3.

One must carefully consider the ethical implications before assisting OpenAI, especially assisting with their attempts to push the capabilities frontier for coding in particular. There is an obvious argument against participation, including decision theoretic considerations.

I think this loses in this case to the obvious argument for participation, which is that this is purely red teaming and safety work, and we all benefit from it being as robust as possible, and also you can do good safety research using your access. This type of work benefits us all, not only OpenAI.

Thus, yes, I encourage you to apply to this program, and while doing so to be helpful in ensuring that o3 is safe.

What Could Possibly Go Wrong?

Pretty much all the things, at this point, although the worst ones aren’t likely… yet.

GFodor.id: It’s hard to take anyone seriously who can see a PhD in a box and *not* imagine clearly more than a few plausible mass casualty events due to the evaporation of friction due to lack of know-how and general IQ.

In many places the division is misleading, but for now and at this capability level, it seems reasonable to talk about three main categories of risk here:

Misuse.
Automated R&D and potential takeoffs or self-improvement.
For-real loss of control problems that aren’t #2.

For all previous frontier models, there was always a jailbreak. If someone was determined to get your model to do [X], and your model had the underlying capability to do [X], you could get it to do [X].

In this case, [X] is likely to include substantially aiding a number of catastrophically dangerous things, in the class of cyberattacks or CBRN risks or other such dangers.

Aaron Bergman: Maybe this is obvious but: the other labs seem to be broadly following a pretty normal cluster of commercial and scientific incentives o3 looks like the clearest example yet of OpenAI being ideologically driven by AGI per se.

Like you don’t design a system that costs thousands of dollars to use per API call if you’re focused on consumer utility – you do that if you want to make a machine that can think well, full stop.

Peter Wildeford: I think OpenAI genuinely cares about getting society to grapple with AI progress.

I don’t think ideological is the right term. You don’t make it for direct consumer use if your focus is on consumer utility. But you might well make it for big business, if you’re trying to sell a bunch of drop-in employees to big business at $20k/year a pop or something. That’s a pretty great business if you can get it (and the compute is only $10k, or $1k). And you definitely do it if your goal is to have that model help make your other models better.

It’s weird to me to talk about wanting to make AGI and ASI and the most intelligent thing possible as if it were ideological. Of course you want to make those things… provided you (or we) can stay in control of the outcomes. Just think of the potential! It is only ideological in the sense that it represents a belief that we can handle doing that without getting ourselves killed.

If anything, to me, it’s the opposite. Not wanting to go for ASI because you don’t see the upside is an ideological position. The two reasonable positions are ‘don’t go for ASI yet, slow down there cowboy, we’re not ready to handle this’ and ‘we totally can too handle this, just think of the potential.’ Or even ‘we have to build it before the other guy does,’ which makes me despair but at least I get it. The position ‘nothing to see here what’s the point there is no market for that, move along now, can we get that q4 profit projection memo’ is the Obvious Nonsense.

And of course, if you don’t (as Aaron seems to imply) think Anthropic has its eyes on the prize, you’re not paying attention. DeepMind originally did, but Google doesn’t, so it’s unclear what the mix is at this point over there.

What Could Possibly Go Right?

I want to be clear here that the answer is: Quite a lot of things. Having access to next-level coding and math is great. Having the ability to spend more money to get better answers where it is valuable is great.

Even if this all stays relatively mundane and o3 is ultimately disappointing, I am super excited for the upside, and to see what we all can discover, do, build and automate.

Send in the Skeptic

Guess who.

All right, that’s my fault, I made that way too easy.

Gary Marcus: After almost two years of declaring that a release of GPT-5 is imminent and not getting it, super fans have decided that a demo of system that they did zero personal experimentation with — and that won’t (in full form) be available for months — is a mic-drop AGI moment.

Standards have fallen.

[o1] is not a general purpose reasoner. it works where there is a lot of augmented data etc.

First off it Your Periodic Reminder that progress is anything but slow even if you exclude the entire o-line. It has been a little over two years since there was a demo of GPT-4, with what was previously a two year product cycle. That’s very different from ‘two years of an imminent GPT-5 release.’ In the meantime, models have gotten better across the board. GPT-4o, Claude Sonnet 3.5 and Gemini 1206 all completely demolish the original GPT-4, to speak nothing of o1 or Perplexity or anything else. And we also have o1, and now o3. The practical experience of using LLMs is vastly better than it was two years ago.

Also, quite obviously, you pursue both paths at once, both GPT-N and o-N, and if both succeed great then you combine them.

Srini Pagdyala: If O3 is AGI, why are they spending billions on GPT-5?

Gary Marcus: Damn good question!

So no, not a good question.

Is there now a pattern where ‘old school’ frontier model training runs whose primary plan was ‘add another zero or two’ are generating unimpressive results? Yeah, sure.

Is o3 an actual AGI? No. I’m pretty sure it is not.

But it seems plausible it is AGI-level specifically at coding. And that’s the important one. It’s the one that counts most. If you have that, overall AGI likely isn’t far behind.

This is Almost Certainly Not AGI

I mention this because some were suggesting it might be.

Here’s Yana Welinder claiming o3 is AGI, based off the ARC performance, although she later hedges to ‘partial AGI.’

And here’s Evan Mays, a member of OpenAI’s preparedness team, saying o3 is AGI, although he later deleted it. Are they thinking about invoking the charter? It’s premature, but no longer completely crazy to think about it.

And here’s old school and present OpenAI board member Adam D’Angelo saying ‘Wild that the o3 results are public and yet the market still isn’t pricing in AGI,’ which to be fair it totally isn’t and it should be, whether o3 itself is AGI or not. And Elon Musk agrees.

If o3 was as good on most tasks as it is at coding or math, then it would be AGI.

It is not.

If it was, OpenAI would be communicating about this very differently.

If it was, then that would not match what we saw from o1, or what we would predict from this style of architecture. We should expect o-style models to be relatively good at domains like math and coding where their kind of chain of thought is most useful and it is easiest to automatically evaluate outputs.

That potentially is saying more about the definition of AGI than anything else. But it is certainly saying the useful thing that there are plenty of highly useful human-shaped cognitive things it cannot yet do so well.

How long that lasts? That’s another question.

What would be the most Robin Hanson take here, in response to the ARC score?

Robin Hanson: It’s great to find things AI can’t yet do, and then measure progress in terms of getting AIs to do them. But crazy wrong to declare we’ve achieved AGI when reach human level on the latest such metric. We’ve seen dozens of such metrics so far, and may see dozens more before AGI.

o1 listed 15 when I asked, oddly without any math evals, and Claude gave us 30. So yes, dozens of such cases. We might indeed see dozens more, depending on how we choose them. But in terms of things like ARC, where the test was designed to not be something you could do easily without general intelligence, not so many? It does not feel like we have ‘dozens more’ such things left.

This has nothing to do with the ‘financial definition of AGI’ between OpenAI and Microsoft, of $100 billion in profits. This almost certainly is not that, either, but the two facts are not that related to each other.

Does This Mean the Future is Open Models?

Evan Conrad suggests this, because the expenses will come at runtime, so people will be able to catch up on training the models themselves. And of course this question is also on our minds given DeepSeek v3, which I’m not covering here but certainly makes a strong argument that open is more competitive than it appeared. More on that in future posts.

I agree that the compute shifting to inference relatively helps whoever can’t afford to be spending the most compute on training. That would shift things towards whoever has the most compute for inference. The same goes if inference is used to create data to train models.

Dan Hendrycks: If gains in AI reasoning will mainly come from creating synthetic reasoning data to train on, then the basis of competitiveness is not having the largest training cluster, but having the most inference compute.

This shift gives Microsoft, Google, and Amazon a large advantage.

Inference compute being the true cost also means that model quality and efficiency potentially matters quite a lot. Everything is on a log scale, so even if Meta’s M-5 is sort of okay and can scale like O-5, if it’s even modestly worse, it might cost 10x or 100x more compute to get similar performance.

That leaves a hell of a lot of room for profit margins.

Then there’s the assumption that when training your bespoke model, what matters is compute, and everything else is kind of fungible. I keep seeing this, and I don’t think this is right. I do think you can do ‘okay’ as a fast follower with only compute and ordinary skill in the art. Sure. But it seems to me like the top labs, particularly Anthropic and OpenAI, absolutely do have special sauce, and that this matters. There are a number of strong candidates, including algorithmic tricks and better data.

It also matters whether you actually do the thing you need to do.

Tnishq Abraham: Today, people are saying Google is cooked rofl

Gallabytes: Not me, though. Big parallel thinking just got de-risked at scale. They’ll catch up.

If recursive self-improvement is the game, OpenAI will win. If industrial scaling is the game, it’ll be Google. If unit economics are the game, then everyone will win.

Pushinpronto: Why does OpenAI have an advantage in the case of recursive self-improvement? Is it just the fact that they were first?

Gallabytes: We’re not even quite there yet! But they’ll bet hard on it much faster than Google will, and they have a head start in getting there.

What this does mean is that open models will continue to make progress and will be harder to limit at anything like current levels, if one wanted to do that. If you have an open model Llama-N, it now seems like you can turn it into M(eta)-N, once it becomes known how to do that. It might not be very good, but it will be a progression.

The thinking here by Evan at the link about the implications of takeoff seem deeply confused – if we’re in a takeoff situation then that changes everything and it’s not about ‘who can capture the value’ so much as who can capture the lightcone. I don’t understand how people can look these situations in the face and not only not think about existential risk but also think everything will ‘seem normal.’ He’s the one who said takeoff (and ‘fast’ takeoff, which classically means it’s all over in a matter of hours to weeks)!

As a reminder, the traditional definition of ‘slow’ takeoff is remarkably fast, also best start believing in them, because it sure looks like you’re in one:

Teortaxes: it’s about time ML twitter got brought up to speed on what “takeoff speeds” mean. Christiano: “There will be a complete 4 year interval in which world output doubles, before the first 1 year interval in which world output doubles.” That’s slow. We’re in the early stages of it.

Not Priced In

One answer to ‘why didn’t Nvidia move more’ is of course ‘everything is priced in’ but no of course it isn’t, we didn’t know, stop pretending we knew, insiders in OpenAI couldn’t have bought enough Nvidia here.

Also, on Monday after a few days to think, Nvidia overperformed the Nasdaq by ~3%.

And this was how the Wall Street Journal described that, even then:

No, I didn’t buy more on Friday, I keep telling myself I have Nvidia at home. Indeed I do have Nvidia at home. I keep kicking myself, but that’s how every trade is – either you shouldn’t have done it, or you should have done more. I don’t know that there will be another moment like this one, but if there is another moment this obvious, I hereby pledge in public to at least top off a little bit, Nick is correct in his attitude here you do not need to do the research because you know this isn’t priced in but in expectation you can assume that everything you are not thinking about is priced in.

And now, as I finish this up, Nvidia has given most of those gains back on no news that seems important to me. You could claim that means yes, priced in. I don’t agree.

Our Media is Failing Us

Spencer Schiff (on Friday): In a sane world the front pages of all mainstream news websites would be filled with o3 headlines right now

The traditional media, instead, did not notice it. At all.

And one can’t help but suspect this was highly intentional. Why else would you announce such a big thing on the Friday afternoon before Christmas?

They did successfully hype it among AI Twitter, also known as ‘the future.’

Bindu Reddy: The o3 announcement was a MASTERSTROKE by OpenAI

The buzz about it is so deafening that everything before it has been be wiped out from our collective memory!

All we can think of is this mythical model that can solve insanely hard problems

Nick: the whole thing is so thielian.

If you’re going to take on a giant market doing probably illegal stuff call yourself something as light and bouba as possible, like airbnb, lyft

If you’re going to announce agi do it during a light and happy 12 days of christmas short demo.

Sam Altman (replying to Nick!): friday before the holidays news dump.

Well, then.

In that crowd, it was all ‘software engineers are cooked’ and people filled with some mix of excitement and existential dread.

But back in the world where everyone else lives…

Benjamin Todd: Most places I checked didn’t mention AI at all, or they’d only have a secondary story about something else like AI and copyright. My twitter is a bubble and most people have no idea what’s happening.

OpenAI: we’ve created a new AI architecture that can provide expert level answers in science, math and coding, which could herald the intelligence explosion.

The media: bond funds!

Davidad: As Matt Levine used to say, People Are Worried About Bond Market Liquidity.

Here is that WSJ story, talking about how GPT-5 or ‘Orion’ has failed to exhibit big intelligence gains despite multiple large training runs. It says ‘so far, the vibes are off,’ and says OpenAI is running into a data wall and trying to fill it with synthetic data. If so, well, they had o1 for that, and now they have o3. The article does mention o1 as the alternative approach, but is throwing shade even there, so expensive it is.

And we have this variation of that article, in the print edition, on Saturday, after o3:

Sam Altman: I think The Wall Street Journal is the overall best U.S. newspaper right now, but they published an article called “The Next Great Leap in AI Is Behind Schedule and Crazy Expensive” many hours after we announced o3?

It wasn’t only WSJ either, there’s also Bloomberg, which normally I love:

On Monday I did find coverage of o3 in Bloomberg, but it not only wasn’t on the front page it wasn’t even on the front tech page, I had to click through to AI.

Another fun one, from Thursday, here’s the original in the NY Times:

Is it Cade Metz? Yep, it’s Cade Metz and also Tripp Mickle. To be fair to them, they do have Demis Hassabis quotes saying chatbot improvements would slow down. And then there’s this, love it:

Not everyone in the A.I. world is concerned. Some, like OpenAI’s chief executive, Sam Altman, say that progress will continue at the same pace, albeit with some twists on old techniques.

That post also mentions both synthetic data and o1.

OpenAI recently released a new system called OpenAI o1 that was built this way. But the method only works in areas like math and computing programming, where there is a firm distinction between right and wrong.

It works best there, yes, but that doesn’t mean it’s the only place that works.

We also had Wired with the article ‘Generative AI Still Needs to Prove Its Usefulness.’

True, you don’t want to make the opposite mistake either, and freak out a lot over something that is not available yet. But this was ridiculous.

Not Covered Here: Deliberative Alignment

I realized I wanted to say more here and have this section available as its own post. So more on this later.

The Lighter Side

Oh no!

Oh no!

Mikael Brockman: o3 is going to be able to create incredibly complex solutions that are incorrect in unprecedentedly confusing ways.

We made everything astoundingly complicated, thus solving the problem once and for all.

Humans will be needed to look at the output of AGI and say, “What the f*** is this? Delete it.”

Oh no!

^{^}

As a quick gesture at the point: as far as I know, all the data LLMs are processing has already gone through a processing filter, namely humans. We produced all the tokens they took in as training data. A newborn, even blind, doesn't have this limitation — and I'd expect a newborn that was given this limitation somehow very much could have stunted intelligence! I think the analog would be less like a blind newborn and more like a numb one, without tactile or proprioceptive senses.

[-]Vladimir_Nesov2mo152

The way performance of o1 falls off much faster than for o3 depending on size of ARC-AGI problems is significant evidence in favor of o3 being built on a different base model than o1, with better long context training or different handling of attention in model architecture. So probably post-trained Orion/GPT-4.5o.

[-]Daniel Kokotajlo2mo146

I continue to think that by this definition takeoff will be fast, not slow. I think recent progress is making my prediction here look more and more likely, because it is making timelines seem shorter and shorter. (Once we get superintelligence, world output will be doubling in a year or less, I think. So slow takeoff by Paul's definition is what happens when powerful pre-ASI systems are nevertheless widely deployed in the economy for long enough to double it. If, like me, you think that ASI is just a few years away, then there isn't much time left for pre-ASI systems to explode into the economy and double it.)

[-]AnthonyC2mo60

I think on a strict interpretation of Christiano's definition, we're almost right on the bubble. Suppose we were to take something like Nvidia's market cap as a very loose proxy for overall growth in accumulated global wealth due to AI. If it keeps doubling or tripling annually, but the return on that capital stays around 4-5% (aka if we assume the market prices things mostly correctly), then there would be a 4 year doubling just before the first 1 year doubling. But if it quadruples or faster annually, there won't be. Note: my math is probably wrong here, and the metric I'm suggesting is definitely wrong, but I don't think that affects my thinking on this in principle. I'm sure with some more work I could figure out the exact rate at which the exponent of an exponential can linearly increase while still technically complying with Christiano's definition.

But really, I don't think this kind of fine splitting rolls with the underlying differences that distinguish fast and slow takeoff as their definers originally intended. I suspect if we do have a classically-defined "fast" takeoff it'll never be reflected in GDP or asset market price statistics at all, because the metric will be obsolete before the data is collected.

Note: I don't actually have a strong opinion or clear preference on whether takeoff will remain smooth or become sharp in this sense.

[-]mishka2mo4-2

Right. We should probably introduce a new name, something like narrow AGI, to denote a system which is AGI-level in coding and math.

This kind of system will be "AGI" as redefined by Tom Davidson in https://www.lesswrong.com/posts/Nsmabb9fhpLuLdtLE/takeoff-speeds-presentation-at-anthropic:

“AGI” (=AI that could fully automate AI R&D)

This is what matters for AI R&D speed and for almost all recursive self-improvement.

Zvi is not quite correct when he is saying

o3 is not that good in coding and math (e.g. it only gets 71.7% on SWE-bench verified), it is not a "narrow AGI" yet. But it is strong enough, it's a giant step forward.

For example, if one takes Sakana's "AI scientist", upgrades it slightly, and uses o3 as a back-end, it is likely that one can generate NeurIPS/ICLR quality papers and as many of those as one wants.

So, another upgrade (or a couple of upgrades) beyond o3, and we will reach that coveted "narrow AGI" stage.

What OpenAI has demonstrated is that it is much easier to achieve "narrow AGI" than "full AGI". This does suggest a road to ASI without going through anything remotely close to a "full AGI" stage, with missing capabilities to be filled afterwards.

[-]Vladimir_Nesov2mo129

When they tested the original GPT-4, under far less dangerous circumstances, for months.

My impression is that it's the product-relevant post-training effort for GPT-4 that took months, the fact that there was also safety testing in the meantime is incidental rather than the cause of it taking months. This claim gets repeated, but I'm not aware of a reason to attribute the gap between Aug 2022 end of pretraining (if I recall the rumors or possibly claims by developers correctly) and Mar 2023 release to safety testing rather than to getting post-training right (in ways that are not specifically about safety).

[-]Yonatan Cale2mo72

fwiw, architecting feels to me easier than coding (I like doing both). I have some guesses on why it doesn't feel like this to most people (architecting is imo somewhat taught wrong, somewhat a gated topic, has less feedback in real life), but I don't think this will stand up to AIs for long and I would even work on building an agent that is good at architecture myself if I thought it would have a positive impact.

If o3/o4 aren't "spontaneously" good at architecture, then I expect it's because openAI didn't figure out (or try to figure out) how to train on relevant data, not many people write down their thoughts as they're planning a new architecture. What data will they use, system design interviews? but to be fair, this is a similar pushback to "there's not much good data on how to plan the code of a computer game" but AIs can still somehow output a working computer game line by line with no scratchpad.

[-]faul_sname2mo30

As someone who has been on both sides of that fence, agreed. Architecting a system is about being aware of hundreds of different ways things can go wrong, recognizing which of those things are likely to impact you in your current use case, and deciding what structure and conventions you will use. It's also very helpful, as an architect, to provide examples usages of the design patterns which will replicate themselves around your new system. All of which are things that current models are already very good, verging on superhuman, at.

On the flip side, I expect that the "piece together context to figure out where your software's model of the world has important holes" part of software engineering will remain relevant even after AI becomes technically capable of doing it, because that process frequently involves access to sensitive data across multiple sources where having an automated, unauthenticated system which can access all of those data sources at once would be a really bad idea (having a single human able to do all that is also a pretty bad idea in many cases, but at least the human has skin in the game).

[-]AnthonyC2mo7-1

This is not a new thought, but I continue to find myself deeply confused by how so many people think the world does not contain enough data to train a mind to be highly smart and capable in at least most fields of human knowledge. Or, at least enough to train it to the point where reaching competence requires only a modest amount of hands-on practice and deliberate feedback. Like, I know the scaling pattern is going to be different for AI vs humans, but at a deep level, how do these people think schools and careers work?

[-]Valentine2mo50

I think the issue is that the form of the data matters a lot. ChatGPT basically just works with symbols as far as I know. How many symbols are there for it to eat? Are there enough to give the same depth of understanding that a human gets from processing spatial info for instance?

I don't think this is a hard limit for AI in general. But I can totally understand why there's skepticism about there being enough digital data right now to produce ASI on the current approach. I'd expect the AIs would need to develop different kinds of data collection, like using robots as sense organs for instance.

[-]Vladimir_Nesov2mo120

How many symbols are there for it to eat? Are there enough to give the same depth of understanding that a human gets from processing spatial info for instance?

Yes. It's not the case that humans blind from birth are dramatically less intelligent, learning from sound and touch is sufficient. LLMs are much less data efficient with respect to external data, because they only learn external data. For a human mind, most data it learns is probably to a large extent self-generated, synthetic, so only having access to much less external data is not a big issue. For LLMs, there aren't yet general ways of generating synthetic data that can outright compensate for scarcity of external data and improve their general intelligence the way natural text data does, instead of propping up particular narrow capabilities (and hoping for generalization).

It seems to me that there are arguments to be made in both directions.^[1] It's not clear to me just yet which stance is correct. Maybe yours is! I don't know.

My point is that it's understandable for intelligent people to suspect that there isn't enough data available yet to produce ASI on the current approach. You might disagree, and maybe your disagreement is even correct, but I don't think the situation is so vividly clear that it's incomprehensible why many people wouldn't be persuaded.

^{^}
As a quick gesture at the point: as far as I know, all the data LLMs are processing has already gone through a processing filter, namely humans. We produced all the tokens they took in as training data. A newborn, even blind, doesn't have this limitation — and I'd expect a newborn that was given this limitation somehow very much could have stunted intelligence! I think the analog would be less like a blind newborn and more like a numb one, without tactile or proprioceptive senses.

[-]anaguma2mo10

For a human mind, most data it learns is probably to a large extent self-generated, synthetic, so only having access to much less external data is not a big issue.

Could you say more about this? What do you think is the ratio of external to internal data?

[-]AnthonyC2mo20

These are really good points! I think I was kind of assuming that video, photo, and audio data in some sense should be adequate for spatial processing and some other bottlenecks, but maybe not, and there are probably others I have no idea about.

[-]Thane Ruthenis2mo62

I am not convinced, at least for my own purposes, although obviously most people will be unable to come up with valuable insights here. I think salience of ideas is a big deal, people don’t do things, and yes often I get ideas that seem like they might not get discovered forever otherwise

My model is that "people don't do things" is the bigger bottleneck on capabilities progress than "no-one's thought of that yet".

I'm sure there is a person in each AGI lab who has had, at some point, an idea for capability-improvement isomorphic to almost any idea an alignment researcher had (perhaps with some exceptions). But the real blockers are, "Was this person one of the people deciding the direction of company research?", and, "If yes, do they believe in this idea enough to choose to allocate some of the limited research budget to it?".

And the research budget appears very limited. o1 seems to be incredibly simple, so simple all the core ideas were floating around back in 2022. Yet it took perhaps a year (until the Q* rumors in November 2023) to build a proof-of-concept prototype, and two years to ship it. Making even something as straightforward-seeming as that was overwhelmingly fiddly. (Arguably it was also delayed by OpenAI researchers having to star in a cyberpunk soap opera, except what was everyone else doing?)

So making a bad call regarding what bright idea to pursue is highly costly, and there are only so many ideas you can pursue in parallel. This goes tenfold for any ideas that might only work at sufficiently big scale – imagine messing up a GPT-5-level training run because you decided to try out something daring.

But: this still does not mean you can freely share capability insights. Yes, "did an AI capability researcher somewhere ever hear of this idea?" doesn't matter as much as you'd think. What does matter is, "is this idea being discussed widely enough to be fresh on the leading capability researchers' minds?". If yes, then:

They may be convinced by one of the justifications regarding why this is a good idea.
This idea may make it to the top of a leading researcher's mind, such that they would be idly musing on it 24/7 until finding a variant of it/an implementation of it that they'd be willing to try.
If the idea is the talk of the town, they may not face as much reputational damage if they order R&D departments to focus on it and then it fails. (A smaller factor, but likely still in play.)

So I think avoiding discussion of potential capability insights is ever a good policy.

Edit: I. e., don't give capability insights steam.

[-]yo-cuddles2mo30

Can you make some noise in the direction of the shockingly low numbers it gets on early arc-2 benchmarks? This feels like pretty open and shut proof that it doesn't generalize, no?

The fact that the model was trained on 75 percent of the training set feels like they ghetto rigged a test set and RL'd the thing to success. If the <30% score on the second test ends up being true, I feel like that should inform our guesses at what it's actually doing heavily away from genuine intelligence and towards a brute force search for verifiable answers.

The frontier tests just feel unconvincing. Chances are, there are well known problem structures with well known solution structures, it's just plugging and chugging. Mathematicians who have looked at some sample problems have indicated that both tier 1 and tier 2 problems have solutions they know by reflex, which implies these o3 results are not indicative of anything super interesting

This just feels like a nothingburger and I'm waiting for someone to tell me why my doubts are misplaced, convincingly

[-]LawrenceC2mo23

Nit:
> OpenAI presented o3 on the Friday before Thanksgiving, at the tail end of the 12 Days of Shipmas.

Should this say Christmas?

[+][comment deleted]2mo10

Deleted by Martin Vlach, 01/03/2025

[-]Lee.aao2mo10

I think I'm confused here.
Is it fair to say that o3 does math and coding better than the average SWE?
If this is true, then I really don't understand why it hasn't made all the headlines.
Any explanation?

LESSWRONG
LW

63

o3, Oh My

63

Table of Contents

GPQA Has Fallen

Codeforces Has Fallen

Arc Has Kinda of Fallen But For Now Only Kinda

They Trained on the Train Set

AIME Has Fallen

Frontier of Frontier Math Shifting Rapidly

FrontierMath 4: We’re Going To Need a Bigger Benchmark

What is o3 Under the Hood?

Not So Fast!

Deep Thought

Our Price Cheap

Has Software Engineering Fallen?

Don’t Quit Your Day Job

Master of Your Domain

Safety Third

The Safety Testing Program

Safety testing in the reasoning era

How to apply

What Could Possibly Go Wrong?

What Could Possibly Go Right?

Send in the Skeptic

This is Almost Certainly Not AGI

Does This Mean the Future is Open Models?

Not Priced In

Our Media is Failing Us

Not Covered Here: Deliberative Alignment

The Lighter Side

63