(Concrete, easy-to-answer question below, explanation first)
Common adage: Modern deep learning techniques are sample-inefficient; it takes loads of data for them to learn things. If you pre-train them, it takes less additional data for them to learn something new, but still compared to humans it takes a lot.
Elsewhere, based on papers like this and this, various people have extrapolated the following takes:
--It seems like bigger neural nets need to see less data to reach the same level of performance.
--It seems like bigger neural nets need fewer epochs to reach convergence. Soon they'll only need to see each data point once. (Search this for "multiple epochs")
I feel like this take is in tension with the common adage. I wonder: If there is a fact mentioned in GPT-3's training data, how many times does it need to be mentioned before GPT-3 comes to know that fact? For example, I'm told that GPT-3 knows the names of most prominent members of the rationalist community. How many times has it seen each name? Are we talking ten times, or ten thousand?*
I'd be interested to hear people do a bit of a search for the "most sample-efficient/obscure fact" in GPT-3's repertoire. In this manner we could quantity how many times GPT-3 needs to see something before it learns it. (Maybe we don't have access to the dataset used to train GPT-3. But people at Eleuther.ai have The Pile, right? And they've trained big transformers on it? We could answer the question easily and precisely there, no?)
Or am I thinking about this all wrong somehow? This seems like an obvious idea, I wonder why I haven't heard of it before.
*Suppose it is ten thousand. Then that means one in every ten million two-word strings on the internet is "Paul Christiano." (The dataset for GPT-3 was 300B tokens) Add in all the other rationalists/EAs and probably it means one in every hundred thousand words is the name of some prominent rationalist/EA. Surely this is too much, no? It seems way too much according to Google Ngram Viewer.
Thanks!
LW is sponsored by CFAR so this is kind of correct if you squint a bit