Could we scale to true world understanding? LLMs possess extensive knowledge, yet we don’t often see them produce cross-domain insights in the way a person deeply versed in, say, physics and medicine might. Why doesn’t their breadth of knowledge translate into a similar depth of insight? I suspect a lack of robust conceptual world models is behind this shortfall, and that using outcome-based RL during posttraining might be the missing element for achieving deeper understanding and more effective world models.
Let’s start with a pretrained LLM that was only ...
Might Claude 3.7 performance above the predicted line be because it is specifically trained for long-horizon tasks? I know Claude 3.7 acts more like a coding agent than a standard LLM, which makes me suspect it had considerable RL on larger tasks than say next-word prediction or solving math AIME questions. If this is the case, we can expect an increase in slope from scaling RL in the direction of long-horizon tasks.