Lawrence Tang

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

What evidence is there that a model's labels can benefit its own training? Or that an "ORM" or "PRM" can benefit an LLM? This is the big problem which is not addressed in this article.

Reply

o1: A Technical Primer

Lawrence Tang3mo10

The reinforcement learning is an innovation during train-time, not test-time. This was not clear to me in your article. There are few changes made to test-time, as the model is simply allowed to keep outputting text and decide when to terminate, which 4o does not do.

Reply