The reinforcement learning is an innovation during train-time, not test-time. This was not clear to me in your article. There are few changes made to test-time, as the model is simply allowed to keep outputting text and decide when to terminate, which 4o does not do.
What evidence is there that a model's labels can benefit its own training? Or that an "ORM" or "PRM" can benefit an LLM? This is the big problem which is not addressed in this article.