GPT-4 level open weights models like Llama-3-405B don't seem capable of dangerous cognition. OpenAI o1 demonstrates that a GPT-4 level model can be post-trained into producing useful long horizon reasoning traces. AlphaZero shows how capabilities can be obtained from compute alone, with no additional data. If there is a way of bringing these together, the apparent helplessness of the current generation of open weights models might prove misleading.
Post-training is currently a combination of techniques that use synthetic data and human labeled data. Human labeled data significantly improves quality, but its collection is slow and scales poorly. Synthetic data is an increasingly useful aspect of post-training, and automated aspects of its generation scale easily. Unlike weaker models, GPT-4 level LLMs clearly pass reading comprehension on most occasions, OpenAI o1 improves on this further. This suggests that at some point human data might become mostly unnecessary in post-training, even if it still slightly helps. Without it, post-training becomes automated and gets to use more compute, while avoiding the need for costly and complicated human labeling.
A pretrained model at the next level of scale, such as Llama 4, if made available in open weights, might initially look approximately as tame as current models. OpenAI o1 demonstrates that useful post-training for long sequences of System 2 reasoning is possible. In the case of o1 in particular, this might involve a lot of human labeling, making its reproduction a very complicated process (at least if the relevant datasets are not released, and the reasoning traces themselves are not leaked in large quantities). But if some generally available chatbots at the next level of scale are good enough at automating labeling, this complication could be sidestepped, with o1 style post-training cheaply reproduced on top of a previously released open weights model.
So there is an overhang in an open weights model that's distributed without long horizon reasoning post-training, since applying such post-training significantly improves its capabilities, making perception of its prior capabilities inadequate. The problem right now is that a new level of pretraining scale is approaching in the coming months, while ability to cheaply apply long horizon reasoning post-training might follow shortly thereafter, possibly unlocked by these very same models at the new level of pretraining scale (since it might currently be too expensive for most actors to implement, or to do enough experiments to figure out how). The resulting level of capabilities is currently unknown, and could well remain unknown outside the leading labs until after the enabling artifacts of the open weights pretrained models at the next level of scale have already been published.
It's not necessarily at all impactful. The crucial question for the next few years is whether and where LLM scaling plateaus. Before o1, GPT-4 level models couldn't produce useful reasoning traces that are very long. Reading comprehension just only started mostly working at this scale. And RLHF through PPO is apparently to a large extent a game of carefully balancing early stopping. So it's brittle, doesn't generalize off-distribution very far, which made it unclear if heavier RL can help with System 2 reasoning where it's not capable of rebuilding the capabilities from scratch, overwriting the damage it does to the LLM.
But taking base model scaling one step further should help both with fragility of GPT-4 level models, and with their ability to carry out System 2 reasoning on their own, with a little bit of adaptation that elicits, rather than heavier RL that instills capabilities not already found to a useful extent in the base model. And then there's at least one more step of base model scaling after that (deployed models are about 40 megawatts of H100s, models in training about 150 megawatts, in 1.5 years we'll get to about a gigawatt, with 2x in FLOPs from moving to B200s and whatever effective FLOP/joule Trillium delivers). So there's every chance this advancement is rendered obsolete, to the extent that it's currently observable as a capability improvement, if these scaled up models just start being competent at System 2 reasoning without needing such post-training. Even ability of cheaper models to reason could then be reconstructed by training them on reasoning traces collected from larger models.
On the other hand, this is potentially a new dimension of scaling, if the RL can do real work and brute force scaling of base models wouldn't produce a chatbot good enough to reinvent the necessary RL on its own. There is a whole potentially very general pipeline to generative capabilities here. It starts with preference about the outcome (ability to verify an answer, to simultaneously judge aesthetics and meanings of a poem, to see if a pull request really does resolve the issue). Then it proceeds to training a process supervision model that estimates how good individual reasoning steps are as contributions to getting a good outcome, and optimizing the generative model that proposes good reasoning steps. With a few more years of base model scaling, LLMs are probably going to be good enough at evaluating impactful outcomes that they are incapable of producing directly, so this pipeline gets a lot of capabilities to manufacture. If o1's methodology is already applicable to this, this removes the uncertainty about whether it could've been made to start working without delay relative to the underlying scaling of base models. And the scaling of RL might go a long way before running out of exploitable preference signal about the outcomes of reasoning, independently of how the base models are being scaled up.