Some critique on reasoning models like o1 (by OpenAI) and r1 (by Deepseek).

OpenAI admits that they trained o1 on domains with easy verification but hope reasoners generalize to all domains. Whether or not they generalize beyond their RL training is a trillion-dollar question. Right off the bat, I’ll tell you my take:

o1-style reasoners do not meaningfully generalize beyond their training.

 

A straightforward way to check how reasoners perform on domains without easy verification is benchmarks. On math/coding, OpenAI's o1 models do exceptionally. On everything else, the answer is less clear.

Results that jump out:

  1. o1-preview does worse on personal writing than gpt-4o and no better on editing text, despite costing 6 × more.
  2. OpenAI didn't release scores for o1-mini, which suggests they may be worse than o1-preview. o1-mini also costs more than gpt-4o.
  3. On eqbench (which tests emotional understanding), o1-preview performs as well as gemma-27b.
  4. On eqbench, o1-mini performs as well as gpt-3.5-turbo. No you didn’t misread that: it performs as well as gpt-3.5-turbo.

 

Throughout this essay, I’ve doomsayed o1-like reasoners because they’re locked into domains with easy verification. You won't see inference performance scale if you can’t gather near-unlimited practice examples for o1.

...

I expect transformative AI to come remarkably soon. I hope labs iron out the wrinkles in scaling model size. But if we do end up scaling model size to address these changes, what was the point of inference compute scaling again?

Remember, inference scaling endows today’s models with tomorrow’s capabilities. It allows you to skip the wait. If you want faster AI progress, you want inference to be a 1:1 replacement for training.

o1 is not the inference-time compute unlock we deserve.

If the entire AI industry moves toward reasoners, our future might be more boring than I thought.

New Comment