GPT-4o both has a new tokenizer and was trained directly on audio (whereas my understanding is that GPT-4 was trained only on text and images). Is there precedent for upgrading a model to a new tokenizer? It seems like it's probably better to think of it as an entirely new model. If that's the case, what actually makes it a GPT-4?

New Answer
New Comment

1 Answers sorted by

ryan_greenblatt

146

OpenAI has historically used "GPT-N" to mean "a model as capable as GPT-N where each GPT is around 100-200x more effective compute than the prior GPT".

This applies even if the model was trained considerably later than the original GPT-N and is correspondingly more efficient due to algorithmic improvements.

So, see for instance GPT-3.5-turbo, which corresponds to a model which is somewhere between GPT-3 and GPT-4. There have been multiple releases of GPT-3.5-turbo which have reduced the cost of inference: probably at least some of these releases correspond to a different model which is similarly capable.

The same applies for GPT-4-turbo which is probably also a different model from the original GPT-4.

So, GPT-4o is probably also a different model from GPT-4, but targeting a similar capability level as GPT-4 (perhaps GPT-4.25 or GPT-4.1 to ensure that it is at least a little bit better than GPT-4).

(It's unclear if OpenAI will continue their GPT-N naming convention going forward or if they will stick to 100-200x effective compute per GPT.)

This differs from the naming scheme currently in use by Anthropic and Google. Anthropic and Google have generation names (Claude 3, gemini 1/1.5) and within each generation, they have names corresponding to different capability levels (opus/sonnet/haiku, ultra/pro/nano/flash).

I agree. OpenAI claimed in the gpt-4o blog post that it is an entirely new model trained from the ground up. GPT-N refers to capabilities, not a specific architecture or set of weights. I imagine GPT-5 will likely be an upscaled version of 4o, as the success of 4o has revealed that multi-modal training can reach similar capabilities at what is likely a smaller number of weights (judging by the fact that gpt-4o is cheaper and faster than 4 and 4T)