the data bottleneck that threatens to strangle scaling
There is no data bottleneck (for data that's not necessarily high quality), because data can be repeated in training, about 4 times without much difference compared to unique data, up to about 16 times while still significantly improving the model. This was notably used in Galactica (see Figure 6), published Nov 2022, then there was the systematic study of scaling laws for repeated data from May 2023, recently repeated data was applied in StarCoder 2 (Feb 2024).
A Chinchilla optimal model uses a model size proportional to dataset size, meaning compute is proportional to data squared. If you repeat data 16 times, this means finding a use for 256 times more compute. A filtered and deduplicated CommonCrawl text dataset RedPajama-Data-v2 has 30 trillion tokens. If repeated 16 times with a Chinchilla optimal monolithic Transformer, it would use about 7e28 FLOPs of compute. This scales with data squared, if there is more data to be found, which there certainly is, even if not OOMs more. Assuming BF16 training with 30% utilization, this would require 3.2e10 H100-hours, which assuming $2/hour takes about $65 billion. Anchoring to the rumored 2e25 FLOPs GPT-4 run at $100 million instead, this gives $350 billion. Both numbers are likely currently outside commercial feasibility, if smaller models fail to demonstrate sufficiently impressive feats. And there's still that further quadratic scaling of needed compute with more data than 30 trillion tokens. (Though Microscaling in Blackwell might reduce the cost of effective compute more than otherwise could be expected this soon.)
This post is adapted from a longer, more wide-ranging post on Substack where I attempt to collect some of my thoughts about AI as a relative outsider to the field. The section I have decided to share here, though, I believe to be novel.
Success in self-play by game AIs like AlphaZero has led to some interest in its possibility of loosening (or even doing away with) the data bottleneck that threatens to strangle scaling as a path forward for LLMs. By making analogies to the human phenomena of dialects and group polarization, I provide a starting point for further, more analytically-framed arguments against self-play as a viable future for increasing the intelligence or linguistic capacity of LLMs.
The Viability of the Analogy
My argument rests, crucially, on an analogy between an AI using self-play to create its own training data and a human (or group of humans) interacting with each or the world to learn and adapt. I believe this is a defensible analogy.
Self-play is quite easy to translate into human cognitive experience. What defines self-play is that the same neural network which creates the new data is also the one which learns by it. The fact that this is possible might be philosophically interesting, but the actual experience is pedestrian.
When a chess player computes his next moves — “I do that, then she does that, I take there, she takes back, I move there, check, check, take, take… no, not good” — the chess player is doing a sort of self-play. Even more direct of an analogy is the fact that chess players often do play themselves in chess, and can learn from that experience. The fact that this is fairly directly analogous to AI self-play seems to me obvious.
What may be less obvious but also seems true to me is that self-play is functionally equivalent as well to a siloed group of functionally equivalent human individuals acting amongst one another. The important part is that they are siloed, that they only have the background conditions of cognition and the game itself to go off of. Given enough time and communication, knowledge becomes communal, and we can consider the group as a single intelligence. Or, if that doesn't seem right, we can pluck one person from the group and expect them to have similar behavior as anyone else in the group.
What is important for this analogy is that it is not surprising that AlphaZero was able to learn better chess than anyone else merely from self-play. Give a group of people chess for a long enough period of time and, assuming they have high enough intelligence, they should converge on the most effective strategies. There certainly may be some path dependence, but eventually they should land on a competitive equilibrium roughly equivalent to our own. If this seems implausible, we can postulate a simpler game (for our relatively simpler brain, vis a vis AI): tic-tac-toe. It would be deeply strange to leave a group of human adults alone for a year with nothing to do except learn and excel in tic-tac-toe, come back, and either beat them in a game or have them beat you in a game or even, really, have any differences in strategy between them and you.
Games are competitive systems with evident rules, direct competition, and clear success conditions. In these circumstance, competition creates conformity and less-well-adapted strategies are competed away (or someone just keeps losing, I don't know). These conditions are probably something close to the line between Inadequate Equilibrium and The Secret of Our Success. The limits of competitive conformity are likely far more complicated than these three qualities, but they pose a good start.
Self-Play as Dialect
I see two avenues for ways in which self-play could be supposed to help LLMs which are related, but distinct. The first is in LLMs' facility with language and the second is with LLMs' understanding of the world. These are the flip side of what I see as two frustrations with LLMs: (1) times when they seem not to react accordingly due to a disconnect in understanding what you are actually saying, and (2) difficulties incorporating new knowledge and ensuring that its beliefs cohere with one another/reflect the world. I will tackle them separately, because I believe the 'learning processes' for these two aspects of LLM capacity are analogically separate.
Chess works as an avenue of self-play because the rules are evident and the success condition is clear. If we want to extend self-play to LLMs' use and understanding of language, though, I am much less optimistic. Language is a place where neither of those conditions exist. The rules of language are brutally inevident and a “success condition” of language is an almost meaningless notion. In fact, when a community of individual humans is left alone with a language on their own for a long time, they tend to do the opposite of converge on the larger community: they create a dialect.
All humans have the 'bones' of language somewhere in their brains, but not all humans speak the same language. Why? Because there is no competitive pressure pushing all humans to speak latin other than that imposed on them by Roman centurions. But we can make it harder on ourselves: not only do humans with the same bones of language create different languages, but even humans with the same 'meat' of language create different languages! Barring discussion with the wider linguistic community, a siloed group of humans who speak the same language as everyone else will, almost certainly, create a new dialect if left alone for a long enough period of time.
Similarly, I would expect an LLM’s self-talk to bring it further from our understanding, rather than closer. The minor differences incipient in the LLM’s language-patterns will be emphasized and reinforced by having difference-inflected data drown out the original data in training. Language is a chaotic system and a slightly different initial condition will likely have the LLM moving in a very different direction, linguistically, from us.
Now, I perhaps one can imagine solutions for the AI-dialect problem. For instance, one might make another AI to translate between our own language and the LLM-dialect. However, this merely brings the bottleneck one step further down: the translator-LLM must be trained on some data, but can only get so much from the internet. So its ability to translate (i.e. its linguistic ability) will be constrained by the same constraint that we were trying to resolve.
One possible counterargument is the analogy to babies: it seems very important to language development in infants that they babble. But it seems to me that babbling is more about learning how to control your mouth than it is about how to convey thoughts. All practice for the latter seems, to me, to happen largely by bumping up against others.
The other solution I could think of is that we could have people figure out how to learn to speak LLM-dialect, but I'm not sure this is viable. It could be, but for it to be worth it, there would have to be some concomitant benefit to self-talk.
So, perhaps, someone might say something like “it’s okay if AIs create their own dialects as long as we have translation devices. What self-play can still do is make AIs more effective at answering queries and using language qua language, rather than language qua English.”
Self-Play as Group Polarization
Our contemporary LLMs are incredibly powerful and intelligent, and produce feats of inference and application that make me incredibly wary about doubting their future capacity based on present hindrances. However.
This arugment is that by talking to itself, the AI can achieve accuracy, if not communicability. There is something inherent in the conceptual map of an AI that will bring it closer to truth if it is left to talk to itself.
Again, we should be able to make an analogy to a group of humans, talking to itself, without any contact with the world they are talking about. What happens in these cases? Group polarization:
The whole article is an excellent tour through a fascinating pocket of social psychology which could be of particular interest to rationalists. However, for the purposes of this argument, what matters is that there are two main mechanisms which drive group polarization, one of which does not bear on AI and a second which certainly does: social pressures to conform and limited argument pools. An AI will not experience social pressure to conform. It will certainly have a limited argument pool.
Limited argument pools reflect the fact that people are much better arguing for their position than against it, and when we hear a bunch of arguments all pointing in one direction, we tend to move in that direction (for the very reasonable motivation that the arguments may be convincing!). AIs have a similar problem — they also ‘believe’ the patterns they have already modeled, and self-talk will only reinforce and widen any differences between its patterns and the correct ones.
Much like their human counterparts, LLMs would update their priors in more and more extreme directions based on their initial argument pools. Without a success or failure condition to bump up against, any incipient misconceptions about the world will spin out into larger and larger magnitude. Now, of course, the successes will also spin out into larger and larger magnitude, but this will not mean that the AI is overall more accurate in its patterns. Rather, it would likely just be more consistent, as in more of the implications of its ‘beliefs’ cohering with one another. This is a prediction I am very uncertain on, as I am not sure how to model this.
When we have an AI self-play reasoning, it only has the arguments that it begins with.
Unless we believe that everything we need to know about the world is contained in some form or another in the English-speaking internet in May 2023 and all we need are the (semi-) logical implications of these words — or at the very least that the English-speaking internet in May 2023 is not systemically biased towards any wrong answers — we should be wary of expecting LLMs to learn about the world through self-play.
So what are we to do about the data?
LLMs as iPad Babies
What seems striking to me about all these conversations is the lack of discussion around how lossy language is as an informational compression of the world. Arguably, this is an underlying concept of my post on language and incentives and a core piece of the idea that tails come apart — i.e. that at the extremes, concepts do not provide clear answers to the questions we might expect them to.
All language is, to an extent, simply compression. Attempting to describe an image, pixel by pixel, is a silly endeavor. Just say it’s an impressionistic painting of a deer in a wheat field and get on with your life. Ditto with the world at large. There is so much more data in an three-second slice of living than there is in this post. When we say that the bitter lesson is that leveraging more compute and data does better than theory-laden solutions, well, what if the data for LLMs is already heavily theory-laden? It is not the world, but already a compression of it.
Even more so than is the case for humans, a mistake in an LLM’s conception of language is a mistake in its conception of the world — since all it has is language.
As we can handle more data and need less theory, the natural next step, it seems to me, is to move AI from linguistic to semiotic: AI must experience the world, bump up against it in all its granularity. Then, bridging the linguistic and semiotic is what gives AI (1) more data, (2) access to success conditions, and (3) the ability to self-play productively. Once an AI’s thought patterns can bump up against the world and not just itself, it seems to me that it really will be able to ‘self-play’ (though not, in some sense, fully self-play, since its ‘partner’ in play will be the world — much like how it is for us always).
Some people, though, may be worried about bringing AI autonomously into the world like that. That is reasonable, and why this post is somewhat concerning for me.