This is a linkpost for https://limit-of-rlvr.github.io/

Recent breakthroughs in reasoning-focused large language models (LLMs) like OpenAI-o1, DeepSeek-R1, and Kimi-1.5 have largely relied on Reinforcement Learning with Verifiable Rewards (RLVR), which replaces human annotations with automated rewards (e.g., verified math solutions or passing code tests) to scale self-improvement. While RLVR enhances reasoning behaviors such as self-reflection and iterative refinement, we challenge a core assumption:

Does RLVR actually expand LLMs' reasoning capabilities, or does it merely optimize existing ones?

By evaluating models via pass@k, where success requires just one correct solution among k attempts, we uncover that RL-trained models excel at low k (e.g., pass@1) but are consistently outperformed by base models at high k (e.g., pass@256). This demonstrates that RLVR narrows the model's exploration, favoring known high-reward paths instead of discovering new reasoning strategies. Crucially, all correct solutions from RL-trained models already exist in the base model's distribution, proving RLVR enhances sampling efficiency, not reasoning capacity, while inadvertently shrinking the solution space.

New Comment


4 comments, sorted by Click to highlight new comments since:

But success for most things doesn't require just one correct solution among k attempts, right? For the majority of areas without easily checkable solutions, higher odds of getting it right on the first try or fres tries is both very useful and does seem like evidence of reasoning. Right? Or am I missing something?

Reducing the breadth of search is a substantial downside if it's a large effect. But reliably getting the right answer instead of following weird paths most of which are wrong seems like the essence of good reasoning.

Yes it matters for current model performance, but it means that RLVR isn't actually improving the model in a way that can be used for an iterated distillation & amplification loop, because it doesn't actually do real amplification. If this turns out right, it's quite bearish for AI timelines

Edit: Ah someone just alerted me to the crucial consideration that this was tested using smaller models (like Qwen-2.5 (7B/14B/32B) and LLaMA-3.1-8B, which are significantly smaller than the models where RLVR has shown the most dramatic improvements (like DeepSeek-V3 → R1 or GPT-4o → o1). And given that different researchers have claimed that there's a threshold effect, substantially weakens these findings. But they say they're currently evaluating DeepSeek V3- & R1 so I guess we'll see

More thoughts:

I thought that AlphaZero was a counterpoint, but apparently it's significantly different. For example, it used true self-play allowing it to discover fully novel strategies.

Then again, I don't think more sophisticated reasoning is the bottleneck to AGI (compared to executive function & tool use), so even if reasoning doesn't really improve for a few years we could get AGI.

However, I previously thought reasoning models could be leveraged to figure out how to achieve actions, and then the best actions would be distilled into a better agent model, you know, IDA-style. But this paper makes me more skeptical of that working, because these agentic steps might require novel skills that aren't inside the training data.

It's a shape of a possible ceiling on capability with R1-like training methods, see previous discussion here. The training is very useful, but it might just be ~pass@400 useful rather than ~without limit like AlphaZero. Since base models are not yet reliable at crucial capabilities at ~pass@400, neither would their RL-trained variants become reliable.

It's plausibly a good way of measuring how well an RL training method works for LLMs, another thing to hill-climb on. The question is how easy it will be to extend this ceiling when you are aware of it, and the paper tries a few things that fail utterly (multiple RL training methods, different numbers of training steps, see Figure 7), which weakly suggests it might be difficult to get multiple orders of magnitude further quickly.