Chinchilla scaling law only applies to perplexity, not necessarily to practical (e.g. benchmark) performance
I think perplexity is a better measure of general intelligence than any legible benchmark. There are rumors that in some settings R1-like methods only started showing signs of life for GPT-4 level models where exactly the same thing didn't work for weaker models[1]. Something else might first start working with the kind of perplexity that a competent lab can concoct in a 5e27 FLOPs model, even if it can later be adopted for weaker models.
But GPT-4 didn't just have better perplexity than previous models, it also had substantially better downstream performance. To me it seems more likely that better downstream performance is responsible for the model being well-suited for reasoning RL, since this is what we would intuitively describe as its degree of "intelligence", and intelligence seems important when teaching a model how to reason, while its not clear what perplexity itself would be useful for. (One could probably test this by training a GPT-4 scale model with similar perplexity but on bad training data, such that it only reaches the downstream performance of older models. Then I predict that it would be as bad as those older models when doing reasoning RL. But of course this is a test far too expensive to carry out.)
you don't get performance that is significantly smarter than the humans who wrote the text in the pretraining data
Prediction of details can make use of arbitrarily high levels of capability, vastly exceeding that of the authors of the predicted text. What the token prediction objective gives you is generality and grounding in the world, even if it seems to be inefficient compared to imagined currently-unavailable alternatives.
You may train a model on text typed by little children, such that the model is able to competently imitate a child typing, but then the resulting model performance wouldn't significantly exceed that of a child, even though the model uses a lot of compute. Training on text doesn't really give a lot of direct grounding in the world, because text represents real world data that is compressed and filtered by human brains, and their intelligence acts as a fundamental bottleneck. Imagine you are a natural scientist, but instead of making direct observations in the world, you are locked in a room and limited to listening to what a little kid, who saw the natural world, happens to say about it. After listening to it for a while, at some point you wouldn't learn much more from it about the world.
There were multiple reports claiming that scaling base LLM pretraining yielded unexpected diminishing returns for several new frontier models in 2024, like OpenAI's Orion, which was apparently planned to be GPT-5. They mention a lack of high quality training data, which being the cause would not be surprising, as the Chinchilla scaling law only applies to perplexity, not necessarily to practical (e.g. benchmark) performance. Base language models perform a form of imitation learning, and it seems that you don't get performance that is significantly smarter than the humans who wrote the text in the pretraining data, even if perplexity keeps improving.
Since pretraining compute has in the past been a major bottleneck for frontier LLM performance, a now reduced effect of pretraining means that algorithmic progress within a lab is now more important than it was two years ago. Which would mean the relative importance of having a lot of compute has gone down, and the relative importance of having highly capable AI researchers (which can improve model performance through better AI architectures or training procedures) has gone up. The ability of the AI engineers seems to be much less dependent on available money than compute resources. Which would explain why e.g. Microsoft or Apple don't have highly competitive models, despite large financial resources, and why xAI's Grok 3 isn't very far beyond DeepSeek's R1, despite a vastly greater compute budget.
Now it seems possible that this changes in the future, e.g. when performance starts to strongly depend on inference compute (i.e. not just logarithmically), or when pre-training switches from primarily text to primarily sensory data (like video), which wouldn't be bottlenecked by imitation learning on human-written text. Another possibility is that pre-training on synthetic LLM outputs, like CoTs, could provide the necessary superhuman text for the pretraining data. But none of this is currently the case, as far as I can tell.
If the US introduces UBI (likely mainly through taxation of AI companies), it will only be distributed to US Americans. Which would indeed mean that people which are not citizens of the country that wins the AI race, likely the US, will become a lot poorer than US citizens. Because most of the capital gets distributed to the winning AI company/companies, and consequently to the US.
I think abstract concepts could be distinguished with higher-order logic (= simple type theory). For example, the logical predicate "is spherical" (the property of being a sphere) applies to concrete objects. But the predicate "is a shape" applies to properties, like the property of being a sphere. And properties/concepts are abstract objects. So the shape concept is of a higher logical type than the sphere concept. Or take the "color" concept, the property of being a color. In its extension are not concrete objects, but other properties, like being red. Again, concrete objects can be red, but only properties (like redness), which are abstract objects, can be colors. A tomato is not a color, nor can any other concrete (physical or mental) object be a color. There is a type mismatch.
Formally: Let the type of concrete objects be (for "entity"), and the type of the two truth values (TRUE and FALSE) be (for "truth value"), and let functional types, which take an object of type and return an object of type , be designated with . Then the type of "is a sphere" is , and the type of "is a shape" is . Only objects of type are concrete, so objects of type (properties) are abstract. Even if there weren't any physical spheres, no spherical things like planets or soccer balls, you could still talk about the abstract sphere: the sphere concept, the property of being spherical.
Now the question is whether all the (intuitively) abstract objects can indeed, in principle, be formalized as being of some complex logical type. I think yes. Because: What else could they be? (I know a way of analyzing natural numbers, the prototypical examples of abstract objects, as complex logical types. Namely as numerical quantifiers. Though the analysis in that case is significantly more involved than in the "color" and "shape" examples.)
Reinforcement Learning is very sample-inefficient compared to supervised learning, so it mostly just works if you have some automatic way of generating both training tasks and reward, which scales to millions of samples.