I have an impression that within lifetime human learning is orders of magnitude more sample efficient than large language models, but there are numerous caveats to this:

  1. We don't have "an ecological evaluation objective" for language models (they weren't actually optimised for the downstream language usage tasks on which we compare them to humans)
  2. Insomuch as we do have an ecological evaluation objective (predictive loss on the test set) language models are already very superhuman and apparently even GPT-1 was superhuman at next token prediction
    • Though for similar reasons, next token prediction is not an ecological training objective for humans
      • Humans that specialised at next token prediction (the way some humans specialise at chess) mat show markedly different results
  3. It's plausible that most of the optimisation involved in producing the brain happened over the course of our evolutionary history and within lifetime human learning is more analogous to fine tuning than to training from scratch.

  #3 notwithstanding, I'm curious if we have any robust estimates for how within lifetime human learning compares to deep learning on sample efficiency across various tasks of interest.


Why Does This Matter?

The brain is known to be very energy efficient compared to GPUs of comparable processing power.

However, energy efficiency is just much less taut of a constraint for human engineering than it was for biology (electricity has a much higher throughput than ATP and we have a much larger energy budget). This relative energy abundance would likely remain the case (or rather intensify) as AI systems become more capable.

Thus, the energy efficiency of the brain does not provide much evidence with respect to whether advanced AGI will be neuromorphic.

On the other hand, it seems very plausible that data efficiency is just part and parcel of general intelligence. It may be the case that sufficiently powerful systems would necessarily be more data efficient than the brain (this seems very plausible to me).

If deep learning is sufficiently less data efficient than the brain, it may provide evidence that deep learning wouldn't produce existentially dangerous systems.

We may thus have reason not to expect deep learning to scale to superhuman general intelligence.

New Answer
New Comment

4 Answers sorted by

Darklight

91

The average human lifespan is about 70 years or approximately 2.2 billion seconds. The average human brain contains about 86 billion neurons or roughly 100 trillion synaptic connections. In comparison, something like GPT-3 has 175 billion parameters and 500 billion tokens of data. Assuming very crudely weight/synapse and token/second of experience equivalence, we can see that the human model's ratio of parameters to data is much greater than GPT-3, to the point that humans have significantly more parameters than timesteps (100 trillion to 2.2 billion), while GPT-3 has significantly fewer parameters than timesteps (175 billion to 500 billion). Given the information gain per timestep is different for the two models, but as I said, these are crude approximations meant to convey the ballpark relative difference.

This means basically that humans are much more prone to overfitting the data, and in particular, memorizing individual data points. Hence why humans experience episodic memory of unique events. It's not clear that GPT-3 has the capacity in terms of parameters to memorize its training data with that level of clarity, and arguably this is why such models seem less sample efficient. A human can learn from a single example by memorizing it and retrieving it later when relevant. GPT-3 has to see it enough times in the training data for SGD to update the weights sufficiently that the general concept is embedded in the highly compressed information model.

It's thus, not certain whether or not existing ML models are sample inefficient because of the algorithms being used, or if its because they just don't have enough parameters yet, and increased efficiency will emerge from scaling further.

Bruce G

51

I have an impression that within lifetime human learning is orders of magnitude more sample efficient than large language models

 

Yes, I think this is clearly true, at least with respect to the number of word tokens a human must be exposed to in order to obtain full understanding of one's first language.

Suppose for the sake of argument that someone encounters (through either hearing or reading) 50,000 words per day on average, starting from birth, and that it takes 6000 days (so about 16 years and 5 months) to obtain full adult-level linguistic competence (I can see an argument that full linguistic competence happens years before this, but I don't think you could really argue that it happens much after this).

This would mean that the person encounters a total of 300,000,000 words in the course of gaining full language understanding.  By contrast, the training data numbers I have seen for LLMs are typically in the hundreds of billions of tokens.

And I think there is evidence that humans can obtain linguistic fluency with exposure to far fewer words/tokens than this.

Children born deaf, for example, can only be exposed to a sign-language token when they are looking at the person making the sign, and thus probably get exposure to fewer tokens by default than hearing children who can overhear a conversation somewhere else, but they can still become fluent in sign language.

Even just considering people whose parents did not talk much and who didn't go to school or learn to read, they are almost always able to acquire linguistic competence (except in cases of extreme deprivation).

whereas older humans are 'trained' on perhaps 10B words (about 5 per second for 64 years

I think this is an obviously wrong assumption of training data for within lifetime human learning.

I think it's likely orders of magnitude off?

  • Relevant time frame is childhood
  • Text data consumption seems like a more relevant metric
  • Children do not read at a rate of 5 words per second
  • Children do not read all their lives

Vladimir_Nesov

31

LLMs solve the wrong problem (token prediction) very well. As a side effect, they solve the right problem (talking reasonably) at a decent level. Humans are more aligned with solving the right problem, and they are abominable at token prediction. It's unsurprising that it takes no less than superhuman performance and ridiculous amount of data aimed at the wrong problem to incidentally end up OK on the right problem. And that when actually solving the right problem (as humans do), much less data would suffice.

What LLMs might manage at some point is leverage their understanding of human concepts to retarget the learning process a little bit closer to working on the right problem, something a bit more reasonable as a human-related inductive bias than bare token prediction. This might bring the sample efficiency with respect to the right problem up dramatically, and with it the resulting level of capability given the ridiculous amount of data.

(Though that's probably not the current AGI bottleneck. Which I previously thought it was, but no longer.)

1 comment, sorted by Click to highlight new comments since:

A very significant difference is that active learning (where you have a teacher correcting your mistakes, including being able to do so on a meta-level - e.g. asking your teacher "what's the right way to say X"?) is way more sample efficient that passive learning (simply injesting data without any feedback). This is even more so when passive learning only includes positive examples, and no counterexamples. Do not have a good reference for that handy, but for some simple learning tasks (e.g. learning a regular language AKA a finite automaton) I vaguely recall there is a theoretical result that assymptotical complexity is way lower with active learning. LLMs are thought in a lot more passive way compared to humans.