"Reliable fact recall is valuable, but why would o1 pro be especially good at it? It seems like that would be the opposite of reasoning, or of thinking for a long time?"
Current models were already good at identifying and fixing factual errors when run over a response and asked to critique and fix it. Works maybe 80% of the time to identify whether there's a mistake, and can fix it at a somewhat lower rate.
So not surprising at all that a reasoning loop can do the same thing. Possibly there's some other secret sauce in there, but just critiquing and fixing mistakes is probably enough to see the reported gains in o1.
So far, the answer seems to be that it transfers some, and o1 and o1-pro still seem highly useful in ways beyond reasoning, but o1-style models mostly don’t ‘do their core thing’ in areas where they couldn’t be trained on definitive answers.
Based on:
It seems likely to me that thinking skills transfer pretty well. But then this s trained out because this results in answers that raters don't like. So model memorizes answers its supposed to go with.
So, how about OpenAI’s o1 and o1 Pro?
As a result, the universe realized its mistake, and cancelled the tsunami.
We now have o1, and for those paying $200/month we have o1 pro.
It is early days, but we can say with confidence: They are good models, sir. Large improvements over o1-preview, especially in difficult or extensive coding questions, math, science, logic and fact recall. The benchmark jumps are big.
If you’re in the market for the use cases where it excels, this is a big deal, and also you should probably be paying the $200/month.
If you’re not into those use cases, maybe don’t pay the $200, but others are very much into those tasks and will use this to accelerate those tasks, so this is a big deal.
Table of Contents
Safety Third
This post will be about o1’s capabilities only. Aside from this short summary, it skips covering the model card, the safety issues and questions about whether o1 ‘tried to escape’ or anything like that.
For now, I’ll note that:
Here is the system card if you want to look at that in the meantime.
Rule One
For practical use purposes, evals are negative selection, you need to try it out.
Turning Pro
OpenAI introduces ChatGPT Pro, a $200/month service offering unlimited access to all of their models, including a special o1 pro mode where it uses additional compute.
Yep, premium pricing options are awesome. More like this, please.
$20/month makes your decision easy. If you don’t subscribe to at least one paid service, you’re a fool. If you’re reading this, and you’re not paying for both Claude and ChatGPT at a minimum, you’re still probably making a mistake.
At $200/month for ChatGPT Pro, or $2,400/year, we are plausibly talking real money. That decision is a lot less obvious.
The extra compute helps. The question is how much?
You can mostly ignore all the evals and scores. It’s not about that. It’s about what kind of practical boost you get from unlimited o1 pro and o1 (and voice mode).
When o1 pro is hooked up to an IDE, a web browser or both, that will make a huge practical difference. Right now, it offers neither. It’s a big jump by all reports in deep reasoning and complex PhD-level or higher science and math problems. It solves especially tricky coding questions exceptionally well. But how often are these the modalities you want, and how much value is on the table?
Early poll results (where a full 17% of you said you’d already tried it!) had a majority say it mostly isn’t worth the price, with only a small fraction saying it provides enough value for the common folk who aren’t mainlining.
I think Altman is wrong? Or alternatively, he’s actually saying ‘we don’t expect you to pay $200/month, it would be a bad look if I told you to pay that, and the $20/month product is excellent either way,’ which is reasonable.
I would be very surprised if pro coders weren’t getting great value here. Even if you only solve a few tricky spots each month, that’s already huge.
For short term practical personal purposes, those are the key questions.
Benchmarks
Vellum verifies MMLU, Human Eval and MATH, with very good scores: 92.3% MMLU, 92.4% HumanEval, 94.8% MATH. And that’s all for o1, not o1 pro.
These are big jumps. We also have 83% on AIME 2024.
It’s cheating, in a sense, to compare o1 outputs to Sonnet or GPT-4o outputs, since it uses more compute. But in a more important sense, progress is progress.
Jason Li wrote the 2024 Putnam and fed the questions into o1 (not pro), thinking it got at least half (60/120) and would place in the top ~2%. Dan Hendrycks offered to put them into o1 pro, responses were less impressed, so there’s some mismatch somewhere, Dan suspects he used a worse prompt.
A middle-level-silly benchmark is to open the floor and see what people ask?
Tym Switzer: Budget response:
Groan, fine, I guess, I mean I don’t really know what I was expecting.
Twitter, the floor is yours. What have we got?
Here is o1 pro speculating about potential explanations for unexplained things.
Here is o1 pro searching for the alpha in public markets, sure, but easy question.
Here is o1 pro’s flat tax plan, good instruction following, except I have to dock it tons of points for proactively suggesting an asset tax, and for not analyzing how to avoid reducing net revenue even though that wasn’t requested.
Here is o1 pro explaining Thermodynamic Dissipative adaptation at a post-doc level.
And Claude, commenting on that explanation, which it overall found strong:
There’s a lot more, I recommend browing the thread.
As usual, it seems like you want to play to its strengths, rather than asking generic questions. The good news is that o1’s strengths include fact recall, coding and math and science and logic.
Silly Benchmarks
I always find them fun, but do not forget that they are deeply silly.
This seems importantly incomplete, even when adjusting so ‘easy’ and ‘hard’ refer to what you would expect to be easy or hard for a computer of a given type, rather than what would be easy or hard for a human. That’s because a lot of what matters is how the computer gets the answer right or wrong. We are far too results oriented, here as everywhere, rather than looking at the steps and breaking down the methods.
Fun with self-referential math questions.
Still failing to notice 9.8 is more than 9.11, I see? Although here o1 pro passes.
Ask it to solve physics?
The answer to physics is of course completely Obvious Nonsense but the question essentially asked for completely Obvious Nonsense, so… not bad?
Failing to remember that the Earth is a sphere, which is relevant when a plane flies far enough.
Gallabytes goes super deep on the all-important tic-tac-toe benchmark, for a while was impressed that he couldn’t beat it, then did anyway.
Actually not a bad benchmark. Diminishing returns, so act now.
Here is an especially silly question to focus on:
Reactions to o1
Reactions to o1 were almost universally positive. It’s a good model, sir.
The basics: It’s relatively fast, and seems to hallucinate less.
Note that on the ‘hallucination’ tests per se, o1 did not outperform o1-preview.
The ‘vibe shift’ here is presumably as compared to o1-preview, which I like many others concluded wasn’t worth using in most cases.
Tyler Cowen finds o1 to be an excellent economist, and hard to stump.
Amjad Masad complains the model is not ‘reasoning from first principles’ on controversial questions but rather defaulting to consensus and calling everything else a conspiracy theory. I am confused why he expected it to suddenly start Just Asking Questions, given how it is being trained, and given how reliable consensus is in such situations versus Just Asking Questions, by default?
I bet you could still get it to think with better prompting. I think a certain type of person (which definitely includes Masad) is very inclined to find this type of fault, but as John Schulman explains, you couldn’t do it directly any other way even if you wanted to:
So far, the answer seems to be that it transfers some, and o1 and o1-pro still seem highly useful in ways beyond reasoning, but o1-style models mostly don’t ‘do their core thing’ in areas where they couldn’t be trained on definitive answers.
Reactions to o1 Pro
Reactions to o1 Pro by professionals seem very, very positive, although it does not strictly dominate Claude Sonnet.
TPIronside notes that while Claude Sonnet produces cleaner code, o1 is better at avoiding subtle errors or working with more obscure libraries and code bases. So you’d use Sonnet for most queries, but when something is driving you crazy you would pull out o1 Pro.
The key is what William realizes. The part where something is driving you crazy, or you have to pay down tech debt, is exactly where you end up spending most of your time (in my model and experience). That’s the hard part. So this is huge.
Sully also notes offhand he thinks Gemini-1206 is quite good.
Kakachia777 does a comparison of o1 Pro to Claude 3.5 Sonnet, prefers Sonnet for coding because its code is easier to maintain. They have o1 pro somewhat better at deeper reasoning and complex tasks but not as much as others are saying, and recommends o1 Pro only for those who do specialized PhD-level tasks.
That post also claims new Chinese o1-style models are coming that will be much improved. As always, we shall wait and see.
For that wheelhouse, many report o1 Pro is scary good. Here’s one comment on Kakachia’s post.
Danielle Fong is feeling the headpats, and generally seems positive.
And you can always count on him, but this one does hit a bit different:
Derya Unutmaz reports o1 Pro unlocked great new ideas for his cancer therapy project, and he’s super excited.
A relatively skeptical take on o1-pro that still seems pretty sweet even so?
Here’s one I didn’t expect.
Reliable fact recall is valuable, but why would o1 pro be especially good at it? It seems like that would be the opposite of reasoning, or of thinking for a long time? But perhaps not. Seems like a clue?
Potentially related is that Steve Sokolowski reports it blows away other models at legal research, to the point of enabling pro se cases.
Let Your Coding Work Flow
The problem with using o1 for coding, in a nutshell.
McKay Wrigley (professional impressed person who is indeed also impressed with Gemini 1206, and is also a coder) is super impressed with o1, but will continue using Sonnet as well, because you often don’t want to have to step out of context.
This basic idea makes sense. If you don’t need to rely on lots of context and want to essentially one-shot the problem, you want to Go Big with o1-pro.
If you want to make small adjustments, or write cleaner code, you go with Sonnet.
However, if Sonnet is failing at something and you’re going crazy, you can ‘pull out the bazooka’ and use o1-pro again, despite the context shifting. And indeed, that’s where the bulk of the actual pain comes, in my experience.
Still, putting o1 straight into the IDE would be ten times better, and likely not only get me to definitely pay but also to code a lot more?
Some People Need Practical Advice
I buy that this probably works.
A prompt that predicts a superior result is likely a very good prompt. So if this works without causing o1 to think for longer, my presumption is then that it works because people who take all the time they need, or are told they can do so, produce better answers, so this steers it into a space with better answers.
He also advises using o1 to ask lots of questions while reading books.
To answer Tyler Cowen’s question, I mean, never, obviously. The revolution will not be televised, so almost everyone will miss it. People aren’t going to read books and stop to ask questions. That sounds like work and being curious and paying attention, and people don’t even read books when not doing any of those things.
People definitely aren’t going to start cracking open history books. I mean, ‘cmon.
The ‘ask LLMs lots of questions while reading’ tactic is of course correct. It was correct before using Claude Sonnet, and it’s highly plausible o1 makes it more correct now that you have a second option – I’m guessing you’ll want to mix up which one you use based on the question type. And no, you don’t have to jam the book in the context window – but you could, and in many cases you probably should. What, like it’s hard? If the book is too long, use Gemini-1206.
That said, I’ve spent all day reading and writing and used almost no queries. I ask questions most often when reading papers, then when reading some types of books, but I rarely read books and I’ve been triaging away the papers for now.
One should of course also be asking questions while writing, or reading blogs, or even reading Twitter, but mostly I end up not doing it.
Overall
It is early, but it seems clear that o1 and especially o1 pro are big jumps in capability for things in their wheelhouse. If you want what this kind of extended thinking can get you, including fact recall and relative lack of hallucinations, and especially large or tricky code, math or science problems, and likely most academic style questions, we took a big step up.
When this gets incorporated into IDEs, we should see a big step up in coding. It makes me excited to code again, the way Claude Sonnet 3.5 did (and does, although right now I don’t have the time).
Another key weakness is lack of web browsing. The combination of this plus browsing seems like it will be scary powerful. You’ll still want some combination of GPT-4o and Perplexity in your toolbox.
For other uses, it is too early to tell when you would want to use this over Sonnet 3.5.1. My instinct is that you’d still probably want to default to Sonnet for questions where it should be ‘smart enough’ to give you what you’re looking for, or of course just ask both of them all the time. Also there’s Gemini-1206, which I’m hearing a bunch of positive vibes about, so it might also be worth a look.