The Information: OpenAI shows 'Strawberry' to feds, races to launch it

Martín Soto

Two new The Information articles with insider information on OpenAI's next models and moves.

They are paywalled, but here are the new bits of information:

Strawberry is more expensive and slow at inference time, but can solve complex problems on the first try without hallucinations. It seems to be an application or extension of process supervision
Its main purpose is to produce synthetic data for Orion, their next big LLM
But now they are also pushing to get a distillation of Strawberry into ChatGPT as soon as this fall
They showed it to feds

Some excerpts about these:

Plus this summer, his team demonstrated the technology [Strawberry] to American national security officials, said a person with direct knowledge of those meetings, which haven't previously been reported.

One of the most important applications of Strawberry is to generate high-quality training data for Orion, OpenAI's next flagship large language model that's in development. The codename hasn't previously been reported.

Using Strawberry could help Orion reduce the number of hallucinations, or errors, it produces, researchers tell me. That's because AI models learn from their training data, so the more correct examples of complex reasoning they see, the better. But there's also a push within OpenAI to simplify and shrink Strawberry through a process called distillation, so it can be used in a chat-based product before Orion is released. This shouldn't come as a surprise, given the intensifying competition among the top AI developers. We're not sure what a Strawberry-based product might look like, but we can make an educated guess.
One obvious idea would be incorporating Strawberry's improved reasoning capabilities into ChatGPT. However, though these answers would likely be more accurate, they also might be slower.

Researchers have aimed to launch the new AI, code-named Strawberry (previously called Q*, pronounced Q Star), as part of a chatbot—possibly within ChatGPT—as soon as this fall, said two people who have been involved in the effort. Strawberry can solve math problems it hasn't seen before—something today's chatbots cannot reliably do—and also has been trained to solve problems involving programming. But it's not limited to answering technical questions.

When given additional time to "think," the Strawberry model can also answer customers' questions about more subjective topics, such as product marketing strategies. To demonstrate Strawberry's prowess with language-related tasks, OpenAI employees have shown their co-workers how Strawberry can, for example, solve New York Times Connections, a complex word puzzle.

But OpenAI’s prospects rest in part on the eventual launch of a new flagship LLM it is currently developing, code-named Orion.

It isn’t clear whether a chatbot version of Strawberry that can boost the performance of GPT-4 and ChatGPT will be good enough to launch this year. The chatbot version is a smaller, simplified version of the original Strawberry model, known as a distillation.

However, OpenAI is also using the bigger version of Strawberry to generate data for training Orion, said a person with knowledge of the situation. That kind of AI-generated data is known as "synthetic." It means that Strawberry could help OpenAI overcome limitations on obtaining enough high-quality data to train new models from real-world data such as text or images pulled from the internet.
In addition, Strawberry could aid upcoming OpenAI agents, this person said.

Using Strawberry to generate higher-quality training data could help OpenAI reduce the number of errors its models generate, otherwise known as hallucinations, said Alex Graveley, CEO of agent startup Minion AI and former chief architect of GitHub Copilot.
Imagine "a model without hallucinations, a model where you ask it a logic puzzle and it's right on the first try," Graveley said. The reason why the model is able to do that is because "there is less ambiguity in the training data, so it's guessing less."

“We feel like we have enough [data] for this next model,” Altman said at an event in May, likely referring to Orion. “We have done all sorts of experiments including generating synthetic data.”

Strawberry has its roots in research. It was started years ago by Ilya Sutskever, then OpenAI's chief scientist. He recently left to start a competing AI lab. Before he left, OpenAI researchers Jakub Pachocki and Szymon Sidor built on Sutskever's work by developing a new math-solving model, Q*, alarming some researchers focused on AI safety.
The breakthrough and safety conflicts at OpenAI came just before OpenAI board directors—led by Sutskever—fired Altman before quickly rehiring him.
Last year, in the leadup to Q*, OpenAI researchers developed a variation of a concept known as test-time computation, meant to boost LLMs' problem-solving abilities. The method gives them the opportunity to spend more time considering all parts of a command or question someone has asked the model to execute. At the time, Sutskever published a blog post related to this work.

Did they name it after the strawberry problem!?

Nope! They named her after me.

</joke>

It is probably just a silly arbitrary codename reference to something like Altman growing strawberries at his house, who knows; but I would doubt that it refers to the counting-letters problem specifically because (1) that is due to BPE tokenization, which has way simpler solutions like byte tokenization, and it's not at all obvious how any kind of 'planning' or self-play RL breakthrough would apply to solving spelling gotcha questions; (2) I think that exact variant of the gotcha showed up after the first reporting of 'Strawberry' last year; (3) the reporting about Strawberry implied it was all about math problems like GSM8k, nothing to do with spelling; and (4) there's plenty of other things that would make a lot more sense as a reference (for example, being a riff off LeCun's "cherry" - another small red fruit frequently put on top of dessert cakes).

Alright, well, it probably isn't a reference to that either, because OA's Noam Brown now says post-o1-release that it was an arbitrary codename which doesn't refer to anything: https://x.com/polynoamial/status/1834312400419652079

Believe it or not, the name Strawberry does not come from the “How many r’s are in strawberry” meme. We just chose a random word. As far as we know it was a complete coincidence.

(Nor, given what is described of GPT-4 o1 in the release, can I see any way in which it could be a reference to either strawberry problem. Although it does often solve the BPE-related letter-counting problem for 'strawberry' specifically, it doesn't do so perfectly nor does it solve other BPE-related problems.)

However, OpenAI is also using the bigger version of Strawberry to generate data for training Orion, said a person with knowledge of the situation. That kind of AI-generated data is known as "synthetic." It means that Strawberry could help OpenAI overcome limitations on obtaining enough high-quality data to train new models from real-world data such as text or images pulled from the internet.

Reminded me of these quotes/predictions from Epoch's Trading Off Compute in Training and Inference:

Meanwhile, these companies might be able to leverage additional inference compute to achieve better capabilities at a smaller scale, either for internal use or for a small number of external customers. Policy proposals which seek to control the advancement or proliferation of dangerous AI capabilities should take this possibility into account.

it seems likely that models deployed at scale will be closer to the low end of inference compute. Meanwhile, there will be substantially more capable versions of those models that use more inference compute and therefore won’t be available at scale.

AI progress might be faster than expected in some applications, at a limited scale due to higher inference costs. For example, AI companies might be able to use augmented models to speed up their own AI research.

This sounds genuinely worrying. Largest negative timeline update I've made in many months.

Why?

It was already known the AGI Labs were experimenting with synthetic data and that OpenAI are training GPT-5, and the article is light on new details:

It's not really true that modern AIs "can't reliably solve math problems they haven't seen before": this depends on the operationalization of "a math problem" and "seen before". All this statement says is "Strawberry is better at math than the SOTA models", which in turn means "nonzero AI progress".
Similar for hallucinations.
The one concrete example is solving New York Connections, but Claude 3.5 can already do it on a good day.

I mean, the state of affairs is by no means not worrying, but I don't really see what's in this article would prompt a meaningful update?

I also felt like this was mostly priced in, but I think a maybe more useful prompt for people who feel like they made an update: I think this is a good time to ask "How could I have thought that faster?", and think about what updates you maybe still haven't fully propagated.

Agreed, always a good exercise to do when surprised.

We knew they were experimenting with synthetic data. We didn't know they were succeeding.

The big answer, now that we know what o1 was made using Q*/Strawberry, is essentially that Strawberry/Q* did 2 very important things:

It cracked the code on how to make a General Purpose Search that scales with more compute, and in particular the model can now adaptively think for longer on harder problems.

In essence, OpenAI figured out how to implement General Purpose Search scalably:

https://www.lesswrong.com/posts/6mysMAqvo9giHC4iX/what-s-general-purpose-search-and-why-might-we-expect-to-see

It unlocked a new inference scaling law, which in particular means that more compute can reliably solve more problems at inference.

This makes AI capabilities harder to contain, since it's easier to have large inference runs than large training runs.

I will note that I don't think we've seen this approach work any wonders yet.

(...well unless this is what's up with Sonnet 3.5 being that much better than before 🤷‍♂️)

Should we be worried about the alignment of Strawberry itself?
If it is misaligned, and is providing training data for their next Gen, then it can poison the well, even if Strawberry itself is nowhere near TAI.

Please tell me that they have considered this...
Or that I am wrong and it's not a valid concern.

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?