(Concrete, easy-to-answer question below, explanation first)
Common adage: Modern deep learning techniques are sample-inefficient; it takes loads of data for them to learn things. If you pre-train them, it takes less additional data for them to learn something new, but still compared to humans it takes a lot.
Elsewhere, based on papers like this and this, various people have extrapolated the following takes:
--It seems like bigger neural nets need to see less data to reach the same level of performance.
--It seems like bigger neural nets need fewer epochs to reach convergence. Soon they'll only need to see each data point once. (Search this for "multiple epochs")
I feel like this take is in tension with the common adage. I wonder: If there is a fact mentioned in GPT-3's training data, how many times does it need to be mentioned before GPT-3 comes to know that fact? For example, I'm told that GPT-3 knows the names of most prominent members of the rationalist community. How many times has it seen each name? Are we talking ten times, or ten thousand?*
I'd be interested to hear people do a bit of a search for the "most sample-efficient/obscure fact" in GPT-3's repertoire. In this manner we could quantity how many times GPT-3 needs to see something before it learns it. (Maybe we don't have access to the dataset used to train GPT-3. But people at Eleuther.ai have The Pile, right? And they've trained big transformers on it? We could answer the question easily and precisely there, no?)
Or am I thinking about this all wrong somehow? This seems like an obvious idea, I wonder why I haven't heard of it before.
*Suppose it is ten thousand. Then that means one in every ten million two-word strings on the internet is "Paul Christiano." (The dataset for GPT-3 was 300B tokens) Add in all the other rationalists/EAs and probably it means one in every hundred thousand words is the name of some prominent rationalist/EA. Surely this is too much, no? It seems way too much according to Google Ngram Viewer.
This paper estimates that the human retina conveys visual information to the rest of the brain at 1e7 bits/second. I haven't read the paper though. It's a bit tricky to compare that to pixels anyway, because I think the retina itself does some data compression. I guess we have 6 million cones, which would be ~2M of each type, so maybe vision-at-any-given-time is ballpark comparable to the information content in a 1 megapixel color image??