Comment Permalink

LuigiPagani4mo*10

I agree it’s not very clear. The focus focus of my question would like to be on reasoning benchmarks—specifically in areas like mathematics, coding, and logical reasoning—while disregarding aspects like agency. When it comes to the "next frontier" models, I’d only consider entries like Orion, Claude 3.5 Opus (or Claude 4 Opus, depending on its eventual naming), Llama 4 (big), and Gemini 2 . A good way to identify it would be by the price per million tokens, for example the new Sonnet is much less expensive than o1 and also of Opus, so it doesn't count as next-frontier model. Of course, the increasingly confusing naming conventions these companies adopt make it harder to define and categorize these "frontier models" clearly. I am editing the answer to make it clearer. Thanks a lot for the feedback!

See in context

2

[ Question ]

Will Orion/Gemini 2/Llama-4 outperform o1

by LuigiPagani

18th Nov 2024

1 min read

A

2 3

2

What's your bet on the next frontier models (Orion, Gemini 2, Llama-4) vs o1 in coding, math and logical reasoning benchmarks?

Will it have:

Better performance
Similar performance
Worse performance

Curious to hear your answers...

For OpenAI the question is if the increase in size and training on synthetic data will beat the teaching model, without test time compute.

In the comments there is some clarifications related to what I intend for "next-frontier" models.

BettingForecasting & PredictionForecasts (Specific Predictions)AI

Frontpage

2

New Answer

New Comment

2 Answers sorted by
top scoring

Daniel Kokotajlo

Nov 18, 2024

30

Performance on what benchmarks? Do you mean better at practically everything? Or do you just mean 'better in practice for what most people use it for?' or what?

Also what counts as the next frontier model? E.g. if Anthropic releases "Sonnet 3.5 New v1.1" does that count?

Sorry to be nitpicky here.

I expect there to be something better than o1 available within six months. OpenAI has said that they'll have an agentic assistant up in January IIRC; I expect it to be better than o1.

[-]LuigiPagani4mo*10

I agree it’s not very clear. The focus focus of my question would like to be on reasoning benchmarks—specifically in areas like mathematics, coding, and logical reasoning—while disregarding aspects like agency. When it comes to the "next frontier" models, I’d only consider entries like Orion, Claude 3.5 Opus (or Claude 4 Opus, depending on its eventual naming), Llama 4 (big), and Gemini 2 . A good way to identify it would be by the price per million tokens, for example the new Sonnet is much less expensive than o1 and also of Opus, so it doesn't count as n... (read more)

Reply

LuigiPagani

Nov 18, 2024*

20

I would bet on approximately the same performance in math, coding and reasoning

1

Moderation Log

2

[ Question ]

Will Orion/Gemini 2/Llama-4 outperform o1

2

2

2 Answers sorted by top scoring

Nov 18, 2024

Nov 18, 2024*

2 Answers sorted by
top scoring