GPT-4 level open weights models like Llama-3-405B don't seem capable of dangerous cognition. OpenAI o1 demonstrates that a GPT-4 level model can be post-trained into producing useful long horizon reasoning traces. AlphaZero shows how capabilities can be obtained from compute alone, with no additional data. If there is a way of bringing these together, the apparent helplessness of the current generation of open weights models might prove misleading.
Post-training is currently a combination of techniques that use synthetic data and human labeled data. Human labeled data significantly improves quality, but its collection is slow and scales poorly. Synthetic data is an increasingly useful aspect of post-training, and automated aspects of its generation scale easily. Unlike weaker models, GPT-4 level LLMs clearly pass reading comprehension on most occasions, OpenAI o1 improves on this further. This suggests that at some point human data might become mostly unnecessary in post-training, even if it still slightly helps. Without it, post-training becomes automated and gets to use more compute, while avoiding the need for costly and complicated human labeling.
A pretrained model at the next level of scale, such as Llama 4, if made available in open weights, might initially look approximately as tame as current models. OpenAI o1 demonstrates that useful post-training for long sequences of System 2 reasoning is possible. In the case of o1 in particular, this might involve a lot of human labeling, making its reproduction a very complicated process (at least if the relevant datasets are not released, and the reasoning traces themselves are not leaked in large quantities). But if some generally available chatbots at the next level of scale are good enough at automating labeling, this complication could be sidestepped, with o1 style post-training cheaply reproduced on top of a previously released open weights model.
So there is an overhang in an open weights model that's distributed without long horizon reasoning post-training, since applying such post-training significantly improves its capabilities, making perception of its prior capabilities inadequate. The problem right now is that a new level of pretraining scale is approaching in the coming months, while ability to cheaply apply long horizon reasoning post-training might follow shortly thereafter, possibly unlocked by these very same models at the new level of pretraining scale (since it might currently be too expensive for most actors to implement, or to do enough experiments to figure out how). The resulting level of capabilities is currently unknown, and could well remain unknown outside the leading labs until after the enabling artifacts of the open weights pretrained models at the next level of scale have already been published.
They even managed to publish it in Nature. But if you don't throw out the original data and instead train on both the original data and the generated data, this doesn't seem to happen (see also). Besides, there is the empirical observation that o1 in fact works at GPT-4 scale, so similar methodology might survive more scaling. At least at the upcoming ~5e26 FLOPs level of next year, which is the focus of this post, the hypothetical where an open weights release arrives before there is an open source reproduction of o1's methodology, which subsequently makes that model much stronger in a way that wasn't accounted for when deciding to release that open weights model.
AlphaZero is purely synthetic data, and humans (note congenitally blind humans, so video data isn't crucial) use maybe 10,000 times less natural data than Llama-3-405B (15 trillion tokens) to get better performance, though we individually know much fewer facts. So clearly there is some way to get very far with merely 50 trillion natural tokens, though this is not relevant to o1 specifically.
Another point is that you can repeat the data for LLMs (5-15 times with good results, up to 60 times with slight further improvement, then there is double descent with worst performance at 200 repetitions, so improvement might resume after hundreds of repetitions). This suggests that it might be possible to repeat natural data many times to balance out a lot more unique synthetic data.