Ah great point, regarding the comment you link to:
yes, some reward hacking is going on but at least in Claude (which I work with) this is a rare occurrence in daily practice, and usually follows repeated attempts to actually solve the problem.
I believe that both Deepseek R1-Zero as well as Grok thinking were RL-trained solely on math and code yet their reasoning seems to generalise somewhat to other domains as well.
So, while you’re absolutely right that we can’t do RL directly on the most important outcomes (research progress), I believe there will be significant transfer from what we can do RL on currently.
Would be curious to hear what’s your sense of generalisation from the current narrow RL approaches!
Why specifically would you expect that RL on coding wouldn’t sufficiently advance coding abilities of LLM‘s to significantly accelerate the search for a better learning algorithm or architecture?
If current levels are around GPT-4.5, the compute increase from GPT-4 would be either 10× or 50×, depending on whether we use a log or linear scaling assumption.
The completion of Stargate would then push OpenAI’s compute to around GPT-5.5 levels. However, since other compute expansions (e.g., Azure scaling) are also ongoing, they may reach this level sooner.
Recent discussions have suggested that better base models are a key enabler for the current RL approaches, rather than major changes in RL architecture itself. This suggests that once the base model shifts from a GPT-4o-scale model to a GPT-5.5-scale model, there could be a strong jump in capabilities.
It’s unclear how much of a difference it makes to train the new base model (GPT-5) on reasoning traces from O3/O4 before applying RL. However, by the time the GPT-5 scale run begins, there will likely be a large corpus of filtered, high-quality reasoning traces, further edited for clarity, that will be incorporated into pretraining.
The change to a better base model for RL might enable longer horizon agentic work as an "emergent thing", combined with superhuman coding skills this might already be quite unsafe.
GPT-5’s reasoning abilities may be significantly more domain-specific than prior models.
Ah great point, regarding the comment you link to:
Would be curious to hear what’s your sense of generalisation from the current narrow RL approaches!