(Concrete, easy-to-answer question below, explanation first)
Common adage: Modern deep learning techniques are sample-inefficient; it takes loads of data for them to learn things. If you pre-train them, it takes less additional data for them to learn something new, but still compared to humans it takes a lot.
Elsewhere, based on papers like this and this, various people have extrapolated the following takes:
--It seems like bigger neural nets need to see less data to reach the same level of performance.
--It seems like bigger neural nets need fewer epochs to reach convergence. Soon they'll only need to see each data point once. (Search this for "multiple epochs")
I feel like this take is in tension with the common adage. I wonder: If there is a fact mentioned in GPT-3's training data, how many times does it need to be mentioned before GPT-3 comes to know that fact? For example, I'm told that GPT-3 knows the names of most prominent members of the rationalist community. How many times has it seen each name? Are we talking ten times, or ten thousand?*
I'd be interested to hear people do a bit of a search for the "most sample-efficient/obscure fact" in GPT-3's repertoire. In this manner we could quantity how many times GPT-3 needs to see something before it learns it. (Maybe we don't have access to the dataset used to train GPT-3. But people at Eleuther.ai have The Pile, right? And they've trained big transformers on it? We could answer the question easily and precisely there, no?)
Or am I thinking about this all wrong somehow? This seems like an obvious idea, I wonder why I haven't heard of it before.
*Suppose it is ten thousand. Then that means one in every ten million two-word strings on the internet is "Paul Christiano." (The dataset for GPT-3 was 300B tokens) Add in all the other rationalists/EAs and probably it means one in every hundred thousand words is the name of some prominent rationalist/EA. Surely this is too much, no? It seems way too much according to Google Ngram Viewer.
The sample efficiency is not a formal claim, like, RL algorithms are claimed to be sample inefficient as only takes 10 games of Pacman to a human get good at it, but we can't isolate this knowledge in human brain. The point a human learns to play Pacman it already learned many things, like GPT-3, and we don't know what things contribute to playing Pacman, is it motor skills? spacial skills? Knowing all the skills that enable human to play Pacman in only ten games and passing this as a pre-training for the RL algorithm then training it to play Pacman would be a fair comparison of how sample efficient it is. The same applies for the names example, could we really measure how many times a human heard a name or maybe a similar name?