The most relevant paper I know of comes out of data privacy concerns. See Extracting Training Data from Large Language Models, which defines "k-eidetic memorization" as a string that can be elicited by some prompt and appears in at most k documents in the training set. They find several examples of k=1 memorization, though the strings appear repeatedly in the source documents. Unfortunately their methodology is targeted towards high-entropy strings and so is not universal.
I have a related question I've been trying to operationalize. How well do GPT-3's memories "generalize"? In other words, given some fact in the training data, how far out of the source distribution can GPT-3 "gain information" from that fact?
E.g. training: "Ixlthubs live in the water." Test: does this affect the predicted likelihood of "Ixlthubs live in the Pacific"? What about "Ixlthubs cannot survive on land"? I'd consider this another interesting measure of sample efficiency/generalization performance. I'm attempting to put together a proposal for the BigScience project (some set of synthetic facts to sprinkle throughout the data), but it's my first try at something like this and slow going.
This is great, thanks! Then I wonder what people mean, exactly, when they say current methods are sample-inefficient. k=1 memorization seems to be about as good as humans, and this with tiny artificial neural nets! (Even GPT-3 is a thousand times smaller than a human brain).
Your question is super interesting as well. If you make progress on answering it, I'd love to hear!
First pass at trying to answer:
I'm asking GPT-3 questions of the form "Who is X?" to see what it knows. It knows EY, Paul, Katja, Julia, Wei Dai, Kaj Sotala... It thinks Daniel Kokotajlo is a filmmaker, which is true actually (there are two of us in the world, and the more well-known one is the filmmaker). It thinks Evan Hubinger is a software engineer.
In parallel I'm googling those names in quotes to see how many hits they get. To my surprise there is about ten thousand hits for many of these names, the more popular ones get more. But GPT-3's training data didn't contain the whole internet, right? Just a fraction of it? So presumably it had only one thousand, or one hundred, instances of each name to learn from?
Slight subtlety - GPT-3 might have a bias in its training data towards things related to AI and things of interest to the internet (maybe they scraped a lot of forums as well as just google). I picked some random names from non-western countries - for example, this Estonian politician gets 33,000 hits on Google and wasn't recognised by GPT-3. It thought he was a software developer (though from Estonia). Might mean that if you're estimating sample efficiency from Google search hits on people involved with AI, you'll end up overestimating sample efficiency.
What did it say about me? :D I think I tried asking the AI Dungeon version about me at some point but apparently the adventure game finetuning had made that knowledge inaccessible.
I think this becomes a lot clearer if we distinguish between total and marginal thinking. GPT-3's total sample efficiency for predicting text is poor:
But on-the-margin, it's very sample efficient at learning to perform new text-related tasks:
Essentially, what's happened is GPT-3 is a kind-of mega-analytical-engine that was really sample inefficient to train up to its current level, but that can now be trained to do additional stuff at relatively little extra cost.
Does that resolve the sense of confusion/mystery, or is there more to it that I'm missing?
That does help, thanks. However, now that I understand better what people are saying, I think it's wrong:
The comparison they are making is as follows:
GPT-3 | Human |
Pre-trained on 3x10^11 tokens of text | Pre-trained on 3x10^8 tokens of text (fermi estimate based on WMP 300 so maybe 500 tokens per minute, 10 hours per week reading, 52 weeks a year, over 20 years of life) |
Able to read a new fact once or twice and then learn it / remember it. | Able to read a new fact once or twice and then learn it / remember it |
However, I think this is a bad comparison, because it igno...
Perhaps GPT-3 has more parameters than are probably needed to roughly memorize its very large training data. This would be good since the data contains some low quality garbage, false claims, etc (can think of them as 'noise'). I believe GPT-n are adding parameters faster than training data Here's my summary of a paper that suggests this is the right move:
https://www.youtube.com/watch?v=OzGguadEHOU Microsoft guy Sebastian Bubeck talking about seemingly overparameterized neural models being necessary for learning (due to label noise?). Validation 'early stopping' of training duration or size scaling is a mistake. after you're over some initial hump that would trigger validation early stopping, overfitting is 'benign' [already known, dubbed 'double descent']. As soon as you can defeat adversarial attacks then you're probably using enough parameters. He (+intern) proves that in order to perfectly memorize the label-noised data set such that small perturbations in the noise don't change predicted output, you need a much larger parameter set than the data set (perfectly memorizing the training data set should be possible within some constant factor of its size). He predicts that ImageNet (image labeling task) could benefit from 10-100 billion parameters instead of the current sub-1-billion.
(obviously GPT- are language models but they can be thought of as having an output which is the masked word or the sentence-before-or-after or whatever they're using to train)
You can get an idea of a pre-trained GPT-3's sample efficiency from the GPT-3 fine-tuning API docs. The epoch parameter defaults to 4, and further up in the documentation they recommend fine-tuning with at least 500 examples for 1-2 epochs in the conditional setting (e.g. chatbots). Although training data is often repetitive (implying maybe 2-10x as many effective epochs?), it learns only seeing the data a few times. More evidence of sample efficiency going up with scale you can see in Figure 4.1 in this paper. Sample efficiency also goes up with the amount of data already seen (pre-training).
This suggests that at some scale and some amount of pre-training, we may enter the one-shot learning regime. Then there is no need for "long-range" tricks (RNNs, CNNs, attention) anymore. Instead, one can one-shot learn by backprop while doing the predictions within a relatively short time window.
I have not finetuned GPT-3, but I have done a lot of finetuning with GPT-J 6.1B, which is similar in scale and performance to GPT-3 "Curie."
In my experience, doing more than a single epoch is always harmful when finetuning GPT-J.
I initially thought it was beneficial on one specific dataset, but that turned out to be the exception that proves the rule. I inspected per-token validation loss on that dataset over the course of training, and discovered that the train/val split was imperfect. Training beyond the first epoch only helped on text ...
In general a language model will 'know' the sentence related to the single occurrence of a rare name. I don't think you learn much here if there are enough parameters available to support this memory.
The sample efficiency is not a formal claim, like, RL algorithms are claimed to be sample inefficient as only takes 10 games of Pacman to a human get good at it, but we can't isolate this knowledge in human brain. The point a human learns to play Pacman it already learned many things, like GPT-3, and we don't know what things contribute to playing Pacman, is it motor skills? spacial skills? Knowing all the skills that enable human to play Pacman in only ten games and passing this as a pre-training for the RL algorithm then training it to play Pacman would be a fair comparison of how sample efficient it is. The same applies for the names example, could we really measure how many times a human heard a name or maybe a similar name?
My guess is that the issue of sample efficiency results from equivocation between datasets used for training a model and datasets provided externally. What is the sample efficiency of AlphaZero? It's as bad as anything else if we divide by the datasets generated by amplification, but it's infinitely large if we divide by externally provided datasets, as there are none. The sample efficiency relevant for the cost of training includes the datasets generated by amplification, but in informal comparison with human performance the estimate is about how much the humans observed externally before attaining some level of performance, hence the equivocation.
Similarly if someone figures out amplification for language models (something like debate, but actually works), it can then train on the vastly larger (and better) datasets generated by the model itself, that's only bootstrapped from the external dataset, and so its sample efficiency with respect to the external dataset is going to skyrocket (one issue is that the external dataset is already large, so it's more about quality than quantity, but alternatively this form of training might be able to bootstrap from a much smaller external dataset). So the usual measure of sample efficiency doesn't seem very informative about what's possible with exactly the same learning algorithm after the amplification loop is closed.
For context, I'm interested in questions like "If we had a big transformer that was being fine-tuned as a chatbot with millions of daily conversations, would it be up-to-date on the latest news of the day? What about local news? What about e.g. subculture drama? How often would people have to talk about something for it to be impressed in long-term memory?"
This sounds like something an appropriate amplification may well be able to help the model memorize, even for things mentioned only once, without changing the learning algorithm, at the cost of more training on the auxiliary data generated by the amplification (in this case probably with prompts from the external datasets that need to be combed for rare details).
(I do understand that the question you are asking is about what happens without auxiliary data. I'm commenting on a way accounting for prompt engineering breaks estimates of potential performance of the same learning algorithm. It then becomes an issue of cost, not limitations of the algorithm, in a way that's different from scaling laws.)
Right on. That's a good point. So really I guess the conclusion is: Compute is the bottleneck; an AI chatbot or whatever could totally learn random facts the very first time it encounters them, if you had things set up to amplify that data into some auxiliary dataset and then train on it. Costs a few orders of magnitude more compute perhaps, but gets the job done. Right? (And this could be automated & "smart" in the sense that the AI could decide what stuff to memorize/internalize, what stuff to forget, and what stuff to add to some software database.)
Right. Of course if the sample efficiency of learning improves, the cost goes down, but that's not really crucial for anything. The learning part of AGI is already essentially solved, it just needs to be put into a place where it's getting fed the right data.
"For example, I'm told that GPT-3 knows the names of most prominent members of the rationalist community" - Can you say more about this?
Oliver Habryka once told me that he uses GPT-3 to help him create invite lists for events. E.g. "The LessWrong community organized a celebration of Petrov Day. They invited all the prominent Rationalist/EA-adjacent people in the Bay area. Here is the list of people they invited: [Insert list of people he's thought of so far] [GPT-3 continues the list]" Habryka said it's helped him avoid accidentally forgetting people.
If anyone wants to try this with the Pile, you can download a copy of the Pile here and try GPT-J (6B, which is a lot less than GPT3's 175B) here (hosted) or through HF transformers (locally). If you run into any problems you can DM me or ask on the EleutherAI discord.
(Concrete, easy-to-answer question below, explanation first)
Common adage: Modern deep learning techniques are sample-inefficient; it takes loads of data for them to learn things. If you pre-train them, it takes less additional data for them to learn something new, but still compared to humans it takes a lot.
Elsewhere, based on papers like this and this, various people have extrapolated the following takes:
--It seems like bigger neural nets need to see less data to reach the same level of performance.
--It seems like bigger neural nets need fewer epochs to reach convergence. Soon they'll only need to see each data point once. (Search this for "multiple epochs")
I feel like this take is in tension with the common adage. I wonder: If there is a fact mentioned in GPT-3's training data, how many times does it need to be mentioned before GPT-3 comes to know that fact? For example, I'm told that GPT-3 knows the names of most prominent members of the rationalist community. How many times has it seen each name? Are we talking ten times, or ten thousand?*
I'd be interested to hear people do a bit of a search for the "most sample-efficient/obscure fact" in GPT-3's repertoire. In this manner we could quantity how many times GPT-3 needs to see something before it learns it. (Maybe we don't have access to the dataset used to train GPT-3. But people at Eleuther.ai have The Pile, right? And they've trained big transformers on it? We could answer the question easily and precisely there, no?)
Or am I thinking about this all wrong somehow? This seems like an obvious idea, I wonder why I haven't heard of it before.
*Suppose it is ten thousand. Then that means one in every ten million two-word strings on the internet is "Paul Christiano." (The dataset for GPT-3 was 300B tokens) Add in all the other rationalists/EAs and probably it means one in every hundred thousand words is the name of some prominent rationalist/EA. Surely this is too much, no? It seems way too much according to Google Ngram Viewer.