GPT-4 level open weights models like Llama-3-405B don't seem capable of dangerous cognition. OpenAI o1 demonstrates that a GPT-4 level model can be post-trained into producing useful long horizon reasoning traces. AlphaZero shows how capabilities can be obtained from compute alone, with no additional data. If there is a way of bringing these together, the apparent helplessness of the current generation of open weights models might prove misleading.
Post-training is currently a combination of techniques that use synthetic data and human labeled data. Human labeled data significantly improves quality, but its collection is slow and scales poorly. Synthetic data is an increasingly useful aspect of post-training, and automated aspects of its generation scale easily. Unlike weaker models, GPT-4 level LLMs clearly pass reading comprehension on most occasions, OpenAI o1 improves on this further. This suggests that at some point human data might become mostly unnecessary in post-training, even if it still slightly helps. Without it, post-training becomes automated and gets to use more compute, while avoiding the need for costly and complicated human labeling.
A pretrained model at the next level of scale, such as Llama 4, if made available in open weights, might initially look approximately as tame as current models. OpenAI o1 demonstrates that useful post-training for long sequences of System 2 reasoning is possible. In the case of o1 in particular, this might involve a lot of human labeling, making its reproduction a very complicated process (at least if the relevant datasets are not released, and the reasoning traces themselves are not leaked in large quantities). But if some generally available chatbots at the next level of scale are good enough at automating labeling, this complication could be sidestepped, with o1 style post-training cheaply reproduced on top of a previously released open weights model.
So there is an overhang in an open weights model that's distributed without long horizon reasoning post-training, since applying such post-training significantly improves its capabilities, making perception of its prior capabilities inadequate. The problem right now is that a new level of pretraining scale is approaching in the coming months, while ability to cheaply apply long horizon reasoning post-training might follow shortly thereafter, possibly unlocked by these very same models at the new level of pretraining scale (since it might currently be too expensive for most actors to implement, or to do enough experiments to figure out how). The resulting level of capabilities is currently unknown, and could well remain unknown outside the leading labs until after the enabling artifacts of the open weights pretrained models at the next level of scale have already been published.
The point isn't particularly that it's low-hanging fruit or that this is going to happen with other LLMs soon. I expect that counterfactually System 2 reasoning likely happens soon even with no o1, made easy and thereby inevitable merely by further scaling of LLMs, so the somewhat suprising fact that it works already doesn't significantly move my timelines.
The issue I'm pointing out is timing, a possible delay between when base models at the next level of scale get published in open weights, that don't yet have o1-like System 2 reasoning (Llama 4 seems the most likely specific model like that to come out next year), and a bit later when it becomes feasible to apply post-training for o1-like System 2 reasoning to these base models.
In the interim, the decisions to publish open weights would be governed by the capabilities without System 2 reasoning, and so they won't be informed decisions. It would be very easy to justify decisions to publish even in the face of third party evaluations, since those evaluations won't be themselves applying o1-like post-training to the model that doesn't already have it, in order to evaluate its resulting capabilities. But then a few months later, there is enough know-how in the open to do that, and capabilities cross all the thresholds that would've triggered in those evaluations, but didn't, since o1-like post-training wasn't yet commoditized at the time they were done.