Don't you also need to include the millions of brains over millions of years of pre-training during evolution?
I don’t think that’s a good way to think about things. I think evolution is much more closely analogous to a search over neural architectures, hyperparameters, reward / loss functions, and other things like that, and not like “pretraining” for within-lifetime learning. See my post “Learning From Scratch” In The Brain.
I've only skimmed your post, but is part of the claim that the things that brains instinctually know are too minor to count?
I think the brain has parts that are non-pretrained learning algorithms (cortex, striatum, cerebellum), and I think that the brain has other parts that are not learning algorithms at all (hypothalamus, brainstem).
The hypothalamus & brainstem do lots of things, but I don’t think I would describe them as “knowing” anything at all.
Like, I think there’s a little innate part of the brainstem that triggers vomiting—when it gets certain combinations of input signals then it triggers certain muscle movements and hormones etc. Does that mean that the brainstem “knows when and how to vomit”? I vote for “not really”. I would rather say “the brainstem has an innate vomiting reflex”, or maybe “the brainstem contains a little simple machine that will set off vomiting under thus-and-such circumstance” or something like that.
I think the brainstem “knows when and how to vomit” in the same sense that a mechanical thermostat “knows when to turn on the heater”, and the same sense as a ribosome “knows how to build proteins”, which is to say, that would be a pretty weird use of the word “know”. I think I’d rather reserve the term “knowing” for the information stored in the cortex and related structures. For example, if we use the everyday sense of “know”, then it’s entirely possible for someone to not “know” that vomiting exists at all (i.e. if they’ve never done it or seen it or heard of it), even though that little reflex circuit is present and functional in their own brainstem.
I think that gyri are mostly hard coded by evolution and given how strongly they restrict the computation space that the cortical area can learn, one could consider the cortex to be heavily pre trained by evolution.
Studying geometrical gyri correlation with psychiatry is an ongoing hot topic
Neural network architecture is very different from neural network pretraining. Why do you think gyri are related to the latter not the former? (I think they're related to the former.)
If all humans have about as many neurons in a the gyri that is hardwired to receive from the eyes, it seems safe to assume that the vast majority of humans will end up with this gyri extracting the same features.
Hence my view is that evolution, by imposing a few hardwired connections and gyri geometries, gives an enormous bias in the space of possible networks, which is similar to what pretraining is.
In essence evolution gives a foundational model that we fine tune with our own experiences.
What do you think? Does that make sense?
No, it doesn’t make sense…
by imposing a few hardwired connections and gyri geometries, gives an enormous bias in the space of possible networks, which is similar to what pretraining is.
A 12-layer ConvNet versus a 12-layer fully-connected MLP, given the same data, will wind up with very different trained models that do different things. In that sense, switching from MLP to ConvNet “gives an enormous bias in the space of possible networks”.
But “using a ConvNet” is NOT pretraining, right? You can pretrain a ConvNet (just like you can pretrain anything), but the ConvNet architecture itself is not an example of pretraining.
If all humans have about as many neurons in a the gyri that is hardwired to receive from the eyes, it seems safe to assume that the vast majority of humans will end up with this gyri extracting the same features.
I think it’s true to some extent that two randomly-initialized ML models (with two different random seeds), with similar neural architecture, similar hyperparameters, similar loss functions, similar learning rules, and similar data, may wind up building two similar trained models at the end of the day. And I think that this is an important dynamic to have in mind when we think about humans, especially things like human cross-cultural universals. But that fact is NOT related to pretraining either, right? I’m not talking about pretrained models at all, I’m talking about randomly-initialized models in this paragraph.
How do you define the word “pretraining”? I’m concerned that you’re using the word in a different way than me, and that one of us is misunderstanding standard terminology.
edit: rereading your above comments. I see that I should have made clear that I was thinking more about learned architectures. In which case we apparently agree is I meant what you said in https://www.lesswrong.com/posts/ftEvHLAXia8Cm9W5a/data-and-tokens-a-30-year-old-human-trains-on?commentId=4QtpAo3XXsbeWt4NC
Thank your for taking the time.
I agree that it's probably terminology that is the culprit here. It's entirely my fault: I was using the word pretraining loosely and meant more something like that hyper parameters (number of layers, inputs, outputs, activation fn, loss) are "learned" by evolution. Leaving to us poor creatures only the task to prune neurons and adjust the synaptic weights.
The reason I was thinking at it this way is that I've been reading about NEAT recently, an algorithm that uses a genetic algorithm to learn an architecture as well as train selected architecture. A bit like evolution?
To rephrase my initial point: evolution does its part of the heavy lifting for finding the right brain to live on earth. This shrinks tremendously the space of computation a human has to explore in his lifetime to have a brain fitted to the environnement. This "shrinking of the space" is kinda is like a strong bias towards certain computation. And model pretraining is having the weights of the network already initialized at a value that "already works", kinda like a strong bias too. Hence the link in my mind.
But yeah, evolution does not give us synaptic weights that work so pretraining is not the right word. Unless you are thinking about learned architectures, in that case my point can somewhat work I think.
I would point out that your calculations are based on the incident data our senses pick up, whereas what we learn is based on the information received by our brain. Almost all of the incident data is thrown away much closer to the source. This works because some of the first things we learn/have hard coded are about what data our sensory organs can disregard, because there's so much redundancy. The data rate of our optical and auditory nerves are only about 0.1% (OOM) of the data rates you list here (I couldn't find a good estimate for tactile but it seems like it should be much lower than visual and auditory). Not sure how much other preprocessing and discarding of data happens elsewhere, but it doesn't take that many more steps to close the remaining 1.5 OOMs gap. (1.3 OOMs if you subtract the ~10 years spent asleep).
OTOH I don't think this necessarily undermines your overall conclusion? We're not training LLMs on the same tasks that brains are being trained on, and we're not measuring them by the same metrics, so I'm not sure how to make the comparison fairly.
But then again, if I were to try to read everything in the GPT-4 training data, it would take me on the order of a century of continuous focus and effort given my reading speed just for data input, before doing any other thinking about the data. Adding in the option for using my other senses (voice to text, braille if I knew how, some sort of olfactory encoding if it existed) wouldn't help because I can't actually process those streams in parallel. And I'm only capable of using a tiny fraction of my sense data for language; it's not like I can read outside the fovea, or reading overlapping texts in colors tuned to each cone type, or listen to a hundred conversations at once if they're at different frequencies.
You mention "I would point out that your calculations are based on the incident data our senses pick up, whereas what we learn is based on the information received by our brain. Almost all of the incident data is thrown away much closer to the source."
Wouldn't this be similar to how a Neural Network "disregards" training data that it has already seen? i.e. If it has already learned that pattern, there's no gradient so the loss wouldn't go down. Maybe there's another mechanism that we're missing in current neural nets online training, that would increase training efficiency by recognizing redundant data and prevent a feedforward pass. Tesla does this in an engineered manner where they throw away most data at the source and only learn on "surprise/interventions", which is data that generates a gradient.
I don't really get what you mean by "Not sure how much other preprocessing and discarding of data happens elsewhere, but it doesn't take that many more steps to close the remaining 1.5 OOMs gap." Are you saying that the real calculations are closer to 1.5 orders of magnitude of what I calculated or 1.5% of what I calculated?
Wouldn't this be similar to how a Neural Network "disregards" training data that it has already seen?
I don't know how that's done, sorry. Does it literally throw away the the data without using it for anything whatsoever (And does it do this with on the order of 99.9% of the training data set?)? Or does it process the data but then because it is redundant it has no or almost no effect on the model weights? I'm talking about the former, since the vast majority of our visual data never makes it from the retina to the optic nerve. The latter would be something more like how looking at my bedroom wall yet again has little to no effect on my understanding of any aspect of the world.
And to your second point, yeah I was pretty unclear, sorry. I meant, your original calculation was that a human at age 30 has ~31,728 T tokens worth of data, compared to 1T for GPT4. The human has 31728 times as much, and log (31728) is about 4.5, meaning the human has 4.5 OOMs more training data. But if I'm right that you should cut down your human training data amounts by ~1000x because of throwing it away before it gets processed in the brain at all, then we're left with a human at age 30 having only 31.728x as much. log(31.728)~1.5, aka the human has 1.5 OOMs more training data. The rest of that comment was me indicating that that's just how much data gets to the brain in any form, not how much is actually being processed for training purposes.
The data rate of optical information through human optic nerves to the brain have variously been estimated at about 1-10 megabits per second, which is two or three orders of magnitude smaller than the estimate here. Likewise the bottleneck on tactile sensory information is in the tactile nerves, not the receptors. I don't know about the taste receptors, but I very much doubt that distinct information from every receptor goes into the brain.
While the volume of training data is still likely larger than for current LLMs, I don't think the ratio is anywhere near so large as the conclusion states. A quadrillion "tokens" per year is an extremely loose upper bound, not a lower bound.
Ok, let's examine a more conservative scenario using solely visual input. If we take 10 megabits/s as the base and deduct 30% to account for sleep time, we'll end up with roughly 0.78 petabytes accumulated over 30 years. This translates to approximately 157 trillion tokens in 30 years, or around 5.24 trillion tokens annually. Interestingly, even under these conservative conditions, the estimate significantly surpasses the training data of LLMs (~1 trillion tokens) by two orders of magnitude.
I don't know about other types of data, but the human brain processes only a very small fraction of the visual data, and lies to us about how much we're perceiving.
I did some calculations with a bunch of assumptions and simplifications but here's a high estimate, back of the envelope calculation for the data and "tokens" a 30 year old human would have "trained" on:
This amounts to 153 + .167 + 4.73 + .000004 + .00035 = 158.64 Petabytes assuming 5 bytes per token (i.e. 5 characters) amounts to 31,728 T tokens
This is of course a high estimate and most of this data will clearly have huge compression capacity, but I wanted to get a rough estimate of a high upper bound. Here's the google sheet if anyone wants to copy it or contribute
Discussion
The motivation for this was chinchilla's wild implications post by nostalgebraist and in general the idea that humans "need" much less data to train on than AI.
According to these calculations humans "train" on a lot of Petabytes, around 5 Petabytes per year which amounts to ~1000 T tokens. Given that we're only training our current models on the order of 1 T tokens, the argument that the human brain is more efficient at learning with less data than current LLMs is not fair. Geoff Hinton changing his mind recently about the efficiency of backpropagation from a less efficient to a much more efficient learning algorithm than what the brain is doing, reinforces this point.