But success for most things doesn't require just one correct solution among k attempts, right? For the majority of areas without easily checkable solutions, higher odds of getting it right on the first try or fres tries is both very useful and does seem like evidence of reasoning. Right? Or am I missing something?
Reducing the breadth of search is a substantial downside if it's a large effect. But reliably getting the right answer instead of following weird paths most of which are wrong seems like the essence of good reasoning.
Yes it matters for current model performance, but it means that RLVR isn't actually improving the model in a way that can be used for an iterated distillation & amplification loop, because it doesn't actually do real amplification. If this turns out right, it's quite bearish for AI timelines
Edit: Ah someone just alerted me to the crucial consideration that this was tested using smaller models (like Qwen-2.5 (7B/14B/32B) and LLaMA-3.1-8B, which are significantly smaller than the models where RLVR has shown the most dramatic improvements (like DeepSeek-V3 → R1 or GPT-4o → o1). And given that different researchers have claimed that there's a threshold effect, substantially weakens these findings. But they say they're currently evaluating DeepSeek V3- & R1 so I guess we'll see
More thoughts:
I thought that AlphaZero was a counterpoint, but apparently it's significantly different. For example, it used true self-play allowing it to discover fully novel strategies.
Then again, I don't think more sophisticated reasoning is the bottleneck to AGI (compared to executive function & tool use), so even if reasoning doesn't really improve for a few years we could get AGI.
However, I previously thought reasoning models could be leveraged to figure out how to achieve actions, and then the best actions would be distilled into a better agent model, you know, IDA-style. But this paper makes me more skeptical of that working, because these agentic steps might require novel skills that aren't inside the training data.
It's a shape of a possible ceiling on capability with R1-like training methods, see previous discussion here. The training is very useful, but it might just be ~pass@400 useful rather than ~without limit like AlphaZero. Since base models are not yet reliable at crucial capabilities at ~pass@400, neither would their RL-trained variants become reliable.
It's plausibly a good way of measuring how well an RL training method works for LLMs, another thing to hill-climb on. The question is how easy it will be to extend this ceiling when you are aware of it, and the paper tries a few things that fail utterly (multiple RL training methods, different numbers of training steps, see Figure 7), which weakly suggests it might be difficult to get multiple orders of magnitude further quickly.