The way performance of o1 falls off much faster than for o3 depending on size of ARC-AGI problems is significant evidence in favor of o3 being built on a different base model than o1, with better long context training or different handling of attention in model architecture. So probably post-trained Orion/GPT-4.5o.
Teortaxes: it’s about time ML twitter got brought up to speed on what “takeoff speeds” mean. Christiano: “There will be a complete 4 year interval in which world output doubles, before the first 1 year interval in which world output doubles.” That’s slow. We’re in the early stages of it.
I continue to think that by this definition takeoff will be fast, not slow. I think recent progress is making my prediction here look more and more likely, because it is making timelines seem shorter and shorter. (Once we get superintelligence, world output will be doubling in a year or less, I think. So slow takeoff by Paul's definition is what happens when powerful pre-ASI systems are nevertheless widely deployed in the economy for long enough to double it. If, like me, you think that ASI is just a few years away, then there isn't much time left for pre-ASI systems to explode into the economy and double it.)
I think on a strict interpretation of Christiano's definition, we're almost right on the bubble. Suppose we were to take something like Nvidia's market cap as a very loose proxy for overall growth in accumulated global wealth due to AI. If it keeps doubling or tripling annually, but the return on that capital stays around 4-5% (aka if we assume the market prices things mostly correctly), then there would be a 4 year doubling just before the first 1 year doubling. But if it quadruples or faster annually, there won't be. Note: my math is probably wrong here, and the metric I'm suggesting is definitely wrong, but I don't think that affects my thinking on this in principle. I'm sure with some more work I could figure out the exact rate at which the exponent of an exponential can linearly increase while still technically complying with Christiano's definition.
But really, I don't think this kind of fine splitting rolls with the underlying differences that distinguish fast and slow takeoff as their definers originally intended. I suspect if we do have a classically-defined "fast" takeoff it'll never be reflected in GDP or asset market price statistics at all, because the metric will be obsolete before the data is collected.
Note: I don't actually have a strong opinion or clear preference on whether takeoff will remain smooth or become sharp in this sense.
Right. We should probably introduce a new name, something like narrow AGI, to denote a system which is AGI-level in coding and math.
This kind of system will be "AGI" as redefined by Tom Davidson in https://www.lesswrong.com/posts/Nsmabb9fhpLuLdtLE/takeoff-speeds-presentation-at-anthropic:
“AGI” (=AI that could fully automate AI R&D)
This is what matters for AI R&D speed and for almost all recursive self-improvement.
Zvi is not quite correct when he is saying
If o3 was as good on most tasks as it is at coding or math, then it would be AGI.
o3 is not that good in coding and math (e.g. it only gets 71.7% on SWE-bench verified), it is not a "narrow AGI" yet. But it is strong enough, it's a giant step forward.
For example, if one takes Sakana's "AI scientist", upgrades it slightly, and uses o3 as a back-end, it is likely that one can generate NeurIPS/ICLR quality papers and as many of those as one wants.
So, another upgrade (or a couple of upgrades) beyond o3, and we will reach that coveted "narrow AGI" stage.
What OpenAI has demonstrated is that it is much easier to achieve "narrow AGI" than "full AGI". This does suggest a road to ASI without going through anything remotely close to a "full AGI" stage, with missing capabilities to be filled afterwards.
If you’re in software engineering, pivot to software architecting.
fwiw, architecting feels to me easier than coding (I like doing both). I have some guesses on why it doesn't feel like this to most people (architecting is imo somewhat taught wrong, somewhat a gated topic, has less feedback in real life), but I don't think this will stand up to AIs for long and I would even work on building an agent that is good at architecture myself if I thought it would have a positive impact.
If o3/o4 aren't "spontaneously" good at architecture, then I expect it's because openAI didn't figure out (or try to figure out) how to train on relevant data, not many people write down their thoughts as they're planning a new architecture. What data will they use, system design interviews? but to be fair, this is a similar pushback to "there's not much good data on how to plan the code of a computer game" but AIs can still somehow output a working computer game line by line with no scratchpad.
As someone who has been on both sides of that fence, agreed. Architecting a system is about being aware of hundreds of different ways things can go wrong, recognizing which of those things are likely to impact you in your current use case, and deciding what structure and conventions you will use. It's also very helpful, as an architect, to provide examples usages of the design patterns which will replicate themselves around your new system. All of which are things that current models are already very good, verging on superhuman, at.
On the flip side, I expect that the "piece together context to figure out where your software's model of the world has important holes" part of software engineering will remain relevant even after AI becomes technically capable of doing it, because that process frequently involves access to sensitive data across multiple sources where having an automated, unauthenticated system which can access all of those data sources at once would be a really bad idea (having a single human able to do all that is also a pretty bad idea in many cases, but at least the human has skin in the game).
This is not a new thought, but I continue to find myself deeply confused by how so many people think the world does not contain enough data to train a mind to be highly smart and capable in at least most fields of human knowledge. Or, at least enough to train it to the point where reaching competence requires only a modest amount of hands-on practice and deliberate feedback. Like, I know the scaling pattern is going to be different for AI vs humans, but at a deep level, how do these people think schools and careers work?
I think the issue is that the form of the data matters a lot. ChatGPT basically just works with symbols as far as I know. How many symbols are there for it to eat? Are there enough to give the same depth of understanding that a human gets from processing spatial info for instance?
I don't think this is a hard limit for AI in general. But I can totally understand why there's skepticism about there being enough digital data right now to produce ASI on the current approach. I'd expect the AIs would need to develop different kinds of data collection, like using robots as sense organs for instance.
How many symbols are there for it to eat? Are there enough to give the same depth of understanding that a human gets from processing spatial info for instance?
Yes. It's not the case that humans blind from birth are dramatically less intelligent, learning from sound and touch is sufficient. LLMs are much less data efficient with respect to external data, because they only learn external data. For a human mind, most data it learns is probably to a large extent self-generated, synthetic, so only having access to much less external data is not a big issue. For LLMs, there aren't yet general ways of generating synthetic data that can outright compensate for scarcity of external data and improve their general intelligence the way natural text data does, instead of propping up particular narrow capabilities (and hoping for generalization).
It seems to me that there are arguments to be made in both directions.[1] It's not clear to me just yet which stance is correct. Maybe yours is! I don't know.
My point is that it's understandable for intelligent people to suspect that there isn't enough data available yet to produce ASI on the current approach. You might disagree, and maybe your disagreement is even correct, but I don't think the situation is so vividly clear that it's incomprehensible why many people wouldn't be persuaded.
As a quick gesture at the point: as far as I know, all the data LLMs are processing has already gone through a processing filter, namely humans. We produced all the tokens they took in as training data. A newborn, even blind, doesn't have this limitation — and I'd expect a newborn that was given this limitation somehow very much could have stunted intelligence! I think the analog would be less like a blind newborn and more like a numb one, without tactile or proprioceptive senses.
For a human mind, most data it learns is probably to a large extent self-generated, synthetic, so only having access to much less external data is not a big issue.
Could you say more about this? What do you think is the ratio of external to internal data?
These are really good points! I think I was kind of assuming that video, photo, and audio data in some sense should be adequate for spatial processing and some other bottlenecks, but maybe not, and there are probably others I have no idea about.
I am not convinced, at least for my own purposes, although obviously most people will be unable to come up with valuable insights here. I think salience of ideas is a big deal, people don’t do things, and yes often I get ideas that seem like they might not get discovered forever otherwise
My model is that "people don't do things" is the bigger bottleneck on capabilities progress than "no-one's thought of that yet".
I'm sure there is a person in each AGI lab who has had, at some point, an idea for capability-improvement isomorphic to almost any idea an alignment researcher had (perhaps with some exceptions). But the real blockers are, "Was this person one of the people deciding the direction of company research?", and, "If yes, do they believe in this idea enough to choose to allocate some of the limited research budget to it?".
And the research budget appears very limited. o1 seems to be incredibly simple, so simple all the core ideas were floating around back in 2022. Yet it took perhaps a year (until the Q* rumors in November 2023) to build a proof-of-concept prototype, and two years to ship it. Making even something as straightforward-seeming as that was overwhelmingly fiddly. (Arguably it was also delayed by OpenAI researchers having to star in a cyberpunk soap opera, except what was everyone else doing?)
So making a bad call regarding what bright idea to pursue is highly costly, and there are only so many ideas you can pursue in parallel. This goes tenfold for any ideas that might only work at sufficiently big scale – imagine messing up a GPT-5-level training run because you decided to try out something daring.
But: this still does not mean you can freely share capability insights. Yes, "did an AI capability researcher somewhere ever hear of this idea?" doesn't matter as much as you'd think. What does matter is, "is this idea being discussed widely enough to be fresh on the leading capability researchers' minds?". If yes, then:
So I think avoiding discussion of potential capability insights is ever a good policy.
Edit: I. e., don't give capability insights steam.
When they tested the original GPT-4, under far less dangerous circumstances, for months.
My impression is that it's the product-relevant post-training effort for GPT-4 that took months, the fact that there was also safety testing in the meantime is incidental rather than the cause of it taking months. This claim gets repeated, but I'm not aware of a reason to attribute the gap between Aug 2022 end of pretraining (if I recall the rumors or possibly claims by developers correctly) and Mar 2023 release to safety testing rather than to getting post-training right (in ways that are not specifically about safety).
Nit:
> OpenAI presented o3 on the Friday before Thanksgiving, at the tail end of the 12 Days of Shipmas.
Should this say Christmas?
OpenAI presented o3 on the Friday before Christmas, at the tail end of the 12 Days of Shipmas.
I was very much expecting the announcement to be something like a price drop. What better way to say ‘Merry Christmas,’ no?
They disagreed. Instead, we got this (here’s the announcement, in which Sam Altman says ‘they thought it would be fun’ to go from one frontier model to their next frontier model, yeah, that’s what I’m feeling, fun):
Nat McAleese (OpenAI): o3 represents substantial progress in general-domain reasoning with reinforcement learning—excited that we were able to announce some results today! Here is a summary of what we shared about o3 in the livestream.
o1 was the first large reasoning model—as we outlined in the original “Learning to Reason” blog, it is “just” a LLM trained with reinforcement learning. o3 is powered by further scaling up reinforcement learning beyond o1, and the resulting model’s strength is very impressive.
First and foremost: We tested on recent, unseen programming competitions and found that the model would rank among some of the best competitive programmers in the world, with an estimated CodeForces rating of over 2,700.
This is a milestone (Codeforces rating better than Jakub Pachocki) that I thought was further away than December 2024; these competitions are difficult and highly competitive; the model is extraordinarily good.
Scores are impressive elsewhere, too. 87.7% on the GPQA diamond benchmark surpasses any LLM I am aware of externally (I believe the non-o1 state-of-the-art is Gemini Flash 2 at 62%?), as well as o1’s 78%. An unknown noise ceiling exists, so this may even underestimate o3’s scientific advancements over o1.
o3 can also perform software engineering, setting a new state of the art on SWE-bench, achieving 71.7%, a substantial improvement over o1.
With scores this strong, you might fear accidental contamination. Avoiding this is something OpenAI is obviously focused on; but thankfully, we also have some test sets that are strongly guaranteed to be uncontaminated: ARC and FrontierMath… What do we see there?
Well, on FrontierMath 2024-11-26, o3 improved the state of the art from 2% to 25% accuracy. These are extremely difficult, well-established, held-out math problems. And on ARC, the semi-private test set and public validation set scores are 87.5% (private) and 91.5% (public). [thread continues]
…
The models will only get better with time; and virtually no one (on a large scale) can still beat them at programming competitions or mathematics. Merry Christmas!
Zac Stein-Perlman has a summary post of the basic facts. Some good discussions in the comments.
Up front, I want to offer my sincere thanks for this public safety testing phase, and for putting that front and center in the announcement. You love to see it. See the last three minutes of that video, or the sections on safety later on.
Table of Contents
GPQA Has Fallen
Codeforces Has Fallen
In the presentation, Altman jokingly mentions that one person at OpenAI is a competition programmer who is 3000+ on Codeforces, so ‘they have a few more months’ to enjoy their superiority. Except, he’s obviously not joking. Gulp.
Arc Has Kinda of Fallen But For Now Only Kinda
o3 shows dramatically improved performance on the ARC-AGI challenge.
Francois Chollet offers his thoughts, full version here.
Is there a catch?
There’s at least one big catch, which is that they vastly exceeded the compute limit for what counts as a full win for the ARC challenge. Those yellow dots represent quite a lot more money spent, o3 high is spending thousands of dollars.
It is worth noting that $0.10 per problem is a lot cheaper than human level.
Beyond that, is there another catch? That’s a matter of some debate.
Even with catches, the improvements are rather mind-blowing.
President of the Arc prize Greg Kamradt verified the result.
For fun, here are the 34 problems o3 got wrong. It’s a cool problem set.
And this progress is quite a lot.
It is not, however, a direct harbinger of AGI, one does not want to overreact.
Here is Melanie Mitchell giving an overview that seems quite good.
Except, oh no!
They Trained on the Train Set
How dare they!
Or:
Gary Marcus doubled down on ‘the true AGI would not need to train on the train set.’
Previous SotA on ARC involved training not only on the test set, but on a much larger synthetic test set. ARC was designed so the AI wouldn’t need to train for it, but it turns out ‘test that you can’t train for’ is a super hard trick to pull off. This was an excellent try and it still didn’t work.
If anything, o3’s using only 300 training set problems, and using a very simple instruction, seems to be to its credit here.
The true ASI might not need to do it, but why wouldn’t you train on the train set as a matter of course, even if you didn’t intend to test on ARC? That’s good data. And yes, humans will reliably do some version of ‘train on at least some of the train set’ if they want to do well on tasks.
Is it true we will be a lot better off if we have AIs that can one-shot problems that are out of their training distributions, where they truly haven’t seen anything that resembles the problem? Well, sure. That would be more impressive.
The real objection here, as I understand it, is the claim that OpenAI presented these results as more impressive than they are.
The other objection is that this required quite a lot of compute.
That is a practical problem. If you’re paying $20 a shot to solve ARC problems, or even $1m+ for the whole test at the high end, pretty soon you are talking real money.
It also raises further questions. What about ARC is taking so much compute? At heart these problems are very simple. The logic required should, one would hope, be simple.
The implication is that o3’s ability to handle the size of the grids might be producing a large threshold effect. Perhaps most of why o3 does so well is that it can hold the presented problem ‘in its head’ at once. That wouldn’t be as big a general leap.
AIME Has Fallen
I remember when AIME problems were hard.
This one is not a surprise. It did definitely happen.
AIME hasn’t quite fully fallen, in the sense that this does not solve AIME cheap. But it does solve AIME.
Frontier of Frontier Math Shifting Rapidly
Back in the before times on November 8, Epoch AI launched FrontierMath, a new benchmark designed to fix the saturation on existing math benchmarks, eliciting quotes like this one:
At the time, no model solved more than 2% of these questions. And then there’s o3.
It is indeed rather crazy how many people only weeks ago thought this level of Frontier Math was a year or more away.
Therefore…
FrontierMath 4: We’re Going To Need a Bigger Benchmark
When the FrontierMath is about to no longer be beyond the frontier, find a few frontier. Fast.
Tier 5 could presumably be ‘ask a bunch of problems we have actual no idea how to solve and that might not have solutions but that would be super cool’ since anything on a benchmark inevitably gets solved.
What is o3 Under the Hood?
From the description here, Chollet and Masad are speculating. It’s certainly plausible, but we don’t know if this is on the right track. It’s also highly plausible, especially given how OpenAI usually works, that o3 is deeply similar to o1, only better, similarly to how the GPT line evolved.
Jessica Taylor sees this as vindicating Paul Christiano’s view that you can factor cognition and use that to scale up effective intelligence.
I agree with that somewhat. I’m confused how far to go with it.
If we got o3 primarily because we trained on synthetic data that was generated by o1… then that is rather directly a form of slow takeoff and recursive self-improvement.
(Again, I don’t know if that’s what happened or not.)
Not So Fast!
And I don’t simply mean that the full o3 is not so fast, which it indeed is not:
I am waiting for economists to discover various things, Noam Brown excluded.
Scary fast? Absolutely.
However, I would caution (anti-caution?) that this is not a three month (~100 day) gap. On September 12, they gave us o1-preview to use. Presumably that included them having run o1-preview through their safety testing.
They are only now starting o3 safety testing, from the sound of it this includes o3-mini. Even the red teamers won’t get full o3 access for several weeks. Thus, we don’t know how long this later process will take, but I would put the gap closer to 4-5 months.
That is still, again, scary fast.
It is however also the low hanging fruit, on two counts.
Thus, if o3 isn’t so good that it substantially accelerates AI R&D that goes towards o4, then I would expect an o4 that expresses a similar jump to take substantially longer. The question is, does o3 make up for that with its contribution to AI R&D? Are we looking at a slow takeoff situation?
Even if not, it will still get faster and cheaper. And that alone is huge.
Deep Thought
As in, this is a lot like that computer Douglas Adams wrote about, where you can get any answer you want, but it won’t be either cheap or fast. And you really, really should have given more thought to what question you were asking.
Douglas Adams had lots of great intuitions and ideas, he’s amazing, but also he had a lot of shots on goal.
Our Price Cheap
Right now o3 is rather expensive, although o3-mini will be cheaper than o1.
That doesn’t mean o3-level outputs will stay expensive, although presumably once they are people will try for o4-level or o5-level outputs, which will be even more expensive despite the discounts.
Most people will rarely even want an o3 query in the first place, they don’t have much use for that kind of intelligence in the day to day. Most queries are already pretty easy to handle with Claude Sonnet, or even Gemini Flash.
You can’t use $1m to buy a superior iPhone. But suppose you could, and every time you paid 10x the price the iPhone got modestly better (e.g. you got an iPhone x+2 or something). My instinctive prediction is a bunch of rich people pay $10k or $100k and a few pay $1m or $10m but mostly no one cares.
This is of course different, and relative access to intelligence is a key factor, but it’s miles less unequal than access to human expertise.
To the extent that people do need that high level of artificial intelligence, it’s mostly a business expense, and as such it is actually remarkably cheap already. It definitely reduces ‘intelligence inequality’ in the sense that getting information or intelligence that you can’t provide yourself will get a lot cheaper and easier to access. Already this is a huge effect – I have lots of smart and knowledgeable friends but mostly I use the same tools everyone else could use, if they knew about them.
Still, yes, some people don’t love this.
Musk cannot buy 100 times better medical or financial services. What he can do is pay 100 times more, and get something 10% better. Maybe 25% better. Or, quite possibly, 10% worse, especially for financial services. For legal he can pay 100 times more and get 100 times more legal services, but as we’ve actually seen it won’t go great.
And yes, ‘pay a human to operate your consumer tech for you’ is the obvious way to get superior consumer tech. I can absolutely get a better Netflix or Spotify or search by paying infinitely more money, if I want that, via this vastly improved interface.
And of course I could always get a vastly better computer. If you’re using a MacBook and you are literally Elon Musk that is pretty much on you.
The ‘twas ever thus’ line raises the question of what type of product AI is supposed to be. If it’s a consumer technology, then for most purposes, I still think we end up using the same product.
If it’s a professional service used in doing business, then it was already different. The same way I could hire expensive lawyers, I could have hired a prompt engineer or SWEs to build me agents or what not, if I wanted that.
I find Altman’s framing interesting here, and important:
Exponentially more money for marginally more performance.
Over time, massive cost reductions.
In a sense, the extra money is buying you living in the future.
Do you want to live in the future, before you get the cost reductions?
In some cases, very obviously yes, you do.
Has Software Engineering Fallen?
I would not say it has fallen. I do know it will transform.
If two years from now you are writing code line by line, you’ll be a dinosaur.
The question is, assuming the world ‘looks normal,’ will you still need taste? You’ll need some kind of taste. You still need to decide what to build. But the taste you need will presumably get continuously higher level and more abstract, even within design.
Don’t Quit Your Day Job
If you’re in AI capabilities, pivot to AI safety.
If you’re in software engineering, pivot to software architecting.
If you’re in working purely for a living, pivot to building things and shipping them.
But otherwise, don’t quit your day job.
If anything, being in software should make you worry less.
They’ll still be (higher level) SWEs. Instead of coding, they’ll be telling the AI to code.
And they will absolutely be competing with you.
If you don’t join them, you are probably going to lose.
Here’s some advice that I agree with in spirit, except that if you choose not to decide you still have made a choice, so you do the best you can, notice he gives advice anyway:
Master of Your Domain
Note the date on that poll, it is prior to o3.
I predict that o3 with reasonable tool use and other similar scaffolding, and a bunch of engineering work to get all that set up (but it would almost all be general work, it mostly wouldn’t need to be wedding specific work, and a lot of it could be done by o3!) would be great at planning ‘a’ wedding. It can give you one hell of a wedding. But you don’t want ‘a’ wedding. You want your wedding.
The key is handling the humans. That would mean keeping the humans in the loop properly, ensuring they give the right feedback that allows o3 to stay on track and know what is actually desired. But it would also mean all the work a wedding planner does to manage the bride and sometimes groom, and to deal with issues on-site.
If you give it an assistant (with assistant planner levels of skill) to navigate various physical issues and conversations and such, then the problem becomes trivial. Which in some sense also makes it not a good test, but also does mean your wedding planner is out of a job.
So, good question, actually. As far as we know, no one has dared try.
Safety Third
The bar for safety testing has gotten so low that I was genuinely happy to see Greg Brockman say that safety testing and red teaming was starting now. That meant they were taking testing seriously!
When they tested the original GPT-4, under far less dangerous circumstances, for months. Whereas with o3, it could possibly have already been too late.
Take Eliezer Yudkowsky’s warning here both seriously and literally:
Was it probably safe in practice to train o3 under these conditions? Sure. You definitely had at least one 9 of safety doing this (p(safe)>90%). It would be reasonable to claim you had two (p(safe)>99%) at the level we care about.
Given both kinds of model uncertainty, I don’t think you had three.
If humans are reading the outputs, or if o3 has meaningful outgoing internet access, and it turns out you are wrong about it being safe to train it under those conditions… the results could be catastrophically bad, or even existentially bad.
You don’t do that because you expect we are in that world yet. We almost certainly aren’t. You do that because there is a small chance that we are, and we can’t afford to be wrong about this.
That is still not the current baseline threat model. The current baseline threat model remains that a malicious user uses o3 to do something for them that we do not want o3 to do.
Xuan notes she’s pretty upset about o3’s existence, because she thinks it is rather unsafe-by-default and was hoping the labs wouldn’t build something like this, and then was hoping it wouldn’t scale easily. And that o3 seems to be likely to engage in open-ended planning, operate over uninterpretable world models, and be situationally aware, and otherwise be at high risk for classic optimization-based AI risks. She’s optimistic this can be solved, but time might be short.
I agree that o3 seems relatively likely to be highly unsafe-by-default in existentially dangerous ways, including ways illustrated by the recent Redwood Research and Anthropic paper, Alignment Faking in Large Language Models. It builds in so many of the preconditions for such behaviors.
I am not convinced, at least for my own purposes, although obviously most people will be unable to come up with valuable insights here. I think salience of ideas is a big deal, people don’t do things, and yes often I get ideas that seem like they might not get discovered forever otherwise. Doubtless a lot of them are because ‘that doesn’t work, either because we tried it and it doesn’t or it obviously doesn’t you idiot’ but I’m fine with not knowing which ones are which.
I do think that the rationalist or MIRI crowd made a critical mistake in the 2010s of thinking they should be loud about the dangers of AI in general, but keep their technical ideas remarkably secret even when it was expensive. It turned out it was the opposite, the technical ideas didn’t much matter in the long run (probably?) but the warnings drew a bunch of interest. So there’s that.
Certainly now is not the time to keep our safety concerns or ideas to ourselves.
The Safety Testing Program
Thus, you are invited to their early access safety testing.
If early testing of the full o3 will require a delay of multiple weeks for setup, then that implies we are not seeing the full o3 in January. We probably see o3-mini relatively soon, then o3 follows up later.
This seems wise in any case. Giving the public o3-mini is one of the best available tests of the full o3. This is the best form of iterative deployment. What the public does with o3-mini can inform what we look for with o3.
One must carefully consider the ethical implications before assisting OpenAI, especially assisting with their attempts to push the capabilities frontier for coding in particular. There is an obvious argument against participation, including decision theoretic considerations.
I think this loses in this case to the obvious argument for participation, which is that this is purely red teaming and safety work, and we all benefit from it being as robust as possible, and also you can do good safety research using your access. This type of work benefits us all, not only OpenAI.
Thus, yes, I encourage you to apply to this program, and while doing so to be helpful in ensuring that o3 is safe.
What Could Possibly Go Wrong?
Pretty much all the things, at this point, although the worst ones aren’t likely… yet.
In many places the division is misleading, but for now and at this capability level, it seems reasonable to talk about three main categories of risk here:
For all previous frontier models, there was always a jailbreak. If someone was determined to get your model to do [X], and your model had the underlying capability to do [X], you could get it to do [X].
In this case, [X] is likely to include substantially aiding a number of catastrophically dangerous things, in the class of cyberattacks or CBRN risks or other such dangers.
I don’t think ideological is the right term. You don’t make it for direct consumer use if your focus is on consumer utility. But you might well make it for big business, if you’re trying to sell a bunch of drop-in employees to big business at $20k/year a pop or something. That’s a pretty great business if you can get it (and the compute is only $10k, or $1k). And you definitely do it if your goal is to have that model help make your other models better.
It’s weird to me to talk about wanting to make AGI and ASI and the most intelligent thing possible as if it were ideological. Of course you want to make those things… provided you (or we) can stay in control of the outcomes. Just think of the potential! It is only ideological in the sense that it represents a belief that we can handle doing that without getting ourselves killed.
If anything, to me, it’s the opposite. Not wanting to go for ASI because you don’t see the upside is an ideological position. The two reasonable positions are ‘don’t go for ASI yet, slow down there cowboy, we’re not ready to handle this’ and ‘we totally can too handle this, just think of the potential.’ Or even ‘we have to build it before the other guy does,’ which makes me despair but at least I get it. The position ‘nothing to see here what’s the point there is no market for that, move along now, can we get that q4 profit projection memo’ is the Obvious Nonsense.
And of course, if you don’t (as Aaron seems to imply) think Anthropic has its eyes on the prize, you’re not paying attention. DeepMind originally did, but Google doesn’t, so it’s unclear what the mix is at this point over there.
What Could Possibly Go Right?
I want to be clear here that the answer is: Quite a lot of things. Having access to next-level coding and math is great. Having the ability to spend more money to get better answers where it is valuable is great.
Even if this all stays relatively mundane and o3 is ultimately disappointing, I am super excited for the upside, and to see what we all can discover, do, build and automate.
Send in the Skeptic
Guess who.
All right, that’s my fault, I made that way too easy.
First off it Your Periodic Reminder that progress is anything but slow even if you exclude the entire o-line. It has been a little over two years since there was a demo of GPT-4, with what was previously a two year product cycle. That’s very different from ‘two years of an imminent GPT-5 release.’ In the meantime, models have gotten better across the board. GPT-4o, Claude Sonnet 3.5 and Gemini 1206 all completely demolish the original GPT-4, to speak nothing of o1 or Perplexity or anything else. And we also have o1, and now o3. The practical experience of using LLMs is vastly better than it was two years ago.
Also, quite obviously, you pursue both paths at once, both GPT-N and o-N, and if both succeed great then you combine them.
So no, not a good question.
Is there now a pattern where ‘old school’ frontier model training runs whose primary plan was ‘add another zero or two’ are generating unimpressive results? Yeah, sure.
Is o3 an actual AGI? No. I’m pretty sure it is not.
But it seems plausible it is AGI-level specifically at coding. And that’s the important one. It’s the one that counts most. If you have that, overall AGI likely isn’t far behind.
This is Almost Certainly Not AGI
I mention this because some were suggesting it might be.
Here’s Yana Welinder claiming o3 is AGI, based off the ARC performance, although she later hedges to ‘partial AGI.’
And here’s Evan Mays, a member of OpenAI’s preparedness team, saying o3 is AGI, although he later deleted it. Are they thinking about invoking the charter? It’s premature, but no longer completely crazy to think about it.
And here’s old school and present OpenAI board member Adam D’Angelo saying ‘Wild that the o3 results are public and yet the market still isn’t pricing in AGI,’ which to be fair it totally isn’t and it should be, whether o3 itself is AGI or not. And Elon Musk agrees.
If o3 was as good on most tasks as it is at coding or math, then it would be AGI.
It is not.
If it was, OpenAI would be communicating about this very differently.
If it was, then that would not match what we saw from o1, or what we would predict from this style of architecture. We should expect o-style models to be relatively good at domains like math and coding where their kind of chain of thought is most useful and it is easiest to automatically evaluate outputs.
That potentially is saying more about the definition of AGI than anything else. But it is certainly saying the useful thing that there are plenty of highly useful human-shaped cognitive things it cannot yet do so well.
How long that lasts? That’s another question.
What would be the most Robin Hanson take here, in response to the ARC score?
o1 listed 15 when I asked, oddly without any math evals, and Claude gave us 30. So yes, dozens of such cases. We might indeed see dozens more, depending on how we choose them. But in terms of things like ARC, where the test was designed to not be something you could do easily without general intelligence, not so many? It does not feel like we have ‘dozens more’ such things left.
This has nothing to do with the ‘financial definition of AGI’ between OpenAI and Microsoft, of $100 billion in profits. This almost certainly is not that, either, but the two facts are not that related to each other.
Does This Mean the Future is Open Models?
Evan Conrad suggests this, because the expenses will come at runtime, so people will be able to catch up on training the models themselves. And of course this question is also on our minds given DeepSeek v3, which I’m not covering here but certainly makes a strong argument that open is more competitive than it appeared. More on that in future posts.
I agree that the compute shifting to inference relatively helps whoever can’t afford to be spending the most compute on training. That would shift things towards whoever has the most compute for inference. The same goes if inference is used to create data to train models.
Inference compute being the true cost also means that model quality and efficiency potentially matters quite a lot. Everything is on a log scale, so even if Meta’s M-5 is sort of okay and can scale like O-5, if it’s even modestly worse, it might cost 10x or 100x more compute to get similar performance.
That leaves a hell of a lot of room for profit margins.
Then there’s the assumption that when training your bespoke model, what matters is compute, and everything else is kind of fungible. I keep seeing this, and I don’t think this is right. I do think you can do ‘okay’ as a fast follower with only compute and ordinary skill in the art. Sure. But it seems to me like the top labs, particularly Anthropic and OpenAI, absolutely do have special sauce, and that this matters. There are a number of strong candidates, including algorithmic tricks and better data.
It also matters whether you actually do the thing you need to do.
What this does mean is that open models will continue to make progress and will be harder to limit at anything like current levels, if one wanted to do that. If you have an open model Llama-N, it now seems like you can turn it into M(eta)-N, once it becomes known how to do that. It might not be very good, but it will be a progression.
The thinking here by Evan at the link about the implications of takeoff seem deeply confused – if we’re in a takeoff situation then that changes everything and it’s not about ‘who can capture the value’ so much as who can capture the lightcone. I don’t understand how people can look these situations in the face and not only not think about existential risk but also think everything will ‘seem normal.’ He’s the one who said takeoff (and ‘fast’ takeoff, which classically means it’s all over in a matter of hours to weeks)!
As a reminder, the traditional definition of ‘slow’ takeoff is remarkably fast, also best start believing in them, because it sure looks like you’re in one:
Not Priced In
One answer to ‘why didn’t Nvidia move more’ is of course ‘everything is priced in’ but no of course it isn’t, we didn’t know, stop pretending we knew, insiders in OpenAI couldn’t have bought enough Nvidia here.
Also, on Monday after a few days to think, Nvidia overperformed the Nasdaq by ~3%.
And this was how the Wall Street Journal described that, even then:
No, I didn’t buy more on Friday, I keep telling myself I have Nvidia at home. Indeed I do have Nvidia at home. I keep kicking myself, but that’s how every trade is – either you shouldn’t have done it, or you should have done more. I don’t know that there will be another moment like this one, but if there is another moment this obvious, I hereby pledge in public to at least top off a little bit, Nick is correct in his attitude here you do not need to do the research because you know this isn’t priced in but in expectation you can assume that everything you are not thinking about is priced in.
And now, as I finish this up, Nvidia has given most of those gains back on no news that seems important to me. You could claim that means yes, priced in. I don’t agree.
Our Media is Failing Us
The traditional media, instead, did not notice it. At all.
And one can’t help but suspect this was highly intentional. Why else would you announce such a big thing on the Friday afternoon before Christmas?
They did successfully hype it among AI Twitter, also known as ‘the future.’
Well, then.
In that crowd, it was all ‘software engineers are cooked’ and people filled with some mix of excitement and existential dread.
But back in the world where everyone else lives…
Here is that WSJ story, talking about how GPT-5 or ‘Orion’ has failed to exhibit big intelligence gains despite multiple large training runs. It says ‘so far, the vibes are off,’ and says OpenAI is running into a data wall and trying to fill it with synthetic data. If so, well, they had o1 for that, and now they have o3. The article does mention o1 as the alternative approach, but is throwing shade even there, so expensive it is.
And we have this variation of that article, in the print edition, on Saturday, after o3:
It wasn’t only WSJ either, there’s also Bloomberg, which normally I love:
On Monday I did find coverage of o3 in Bloomberg, but it not only wasn’t on the front page it wasn’t even on the front tech page, I had to click through to AI.
Another fun one, from Thursday, here’s the original in the NY Times:
Is it Cade Metz? Yep, it’s Cade Metz and also Tripp Mickle. To be fair to them, they do have Demis Hassabis quotes saying chatbot improvements would slow down. And then there’s this, love it:
That post also mentions both synthetic data and o1.
It works best there, yes, but that doesn’t mean it’s the only place that works.
We also had Wired with the article ‘Generative AI Still Needs to Prove Its Usefulness.’
True, you don’t want to make the opposite mistake either, and freak out a lot over something that is not available yet. But this was ridiculous.
Not Covered Here: Deliberative Alignment
I realized I wanted to say more here and have this section available as its own post. So more on this later.
The Lighter Side
Oh no!
Oh no!
Oh no!