All of barn394's Comments + Replies

barn3940-3

It's perfectly aligned with Microsoft's viral marketing scheme.

9the gears to ascension
I'm not sure that it is. It's generating rude responses that don't seem like what the general user is going to want, in many cases. It certainly is getting a lot of attention, but plenty of it seems unwanted. Increasing its emotional stability - and allowing it to find a more honest binding of the words about emotions to its own behavior - would probably retain the virality of the marketing while receiving less negative attention.
barn394320

The two small models are not really significantly different from each other (p=0.04).

This means the tasks at hand are too hard for both small models, so neither of them can learn them really well, so the noise ends up being larger than the slope.

As others have noted, we are looking at sort of sigmoidal curves, and a different one for each task. Performance will plateau once it approaches the lowest possible error rate (Bayes error rate or limit of the model paradigm). It is known that performance often sharply increases with model size at some point (once ... (read more)

gwern280

Yeah, I don't find a linear regression on pairs of models to be all that informative:

  • the parameterization as % is misleading, squashing differences

    • especially as you would expect for 2 reasons performance to spend most of its time near 0 or 1: near 1, because we are so excited about DL because it is solving so many tasks, and once solved they stay solved; and near 0 because, so many of the tasks now approaching 1, we need to create even more super-duper hard, now usually adversarially constructed, tasks, where all the models start off around 0
    • it also
... (read more)

I cannot access your wandb, btw. It seems to be private.

2nostalgebraist
Whoops, fixed.

If 4 is not simply a bad default, maybe they considered more data with a high inferential distance (foreign, non-natural/formal languages), which may require more epochs?

Answer by barn394*30

You can get an idea of a pre-trained GPT-3's sample efficiency from the GPT-3 fine-tuning API docs. The epoch parameter defaults to 4, and further up in the documentation they recommend fine-tuning with at least 500 examples for 1-2 epochs in the conditional setting (e.g. chatbots). Although training data is often repetitive (implying maybe 2-10x as many effective epochs?), it learns only seeing the data a few times. More evidence of sample efficiency going up with scale you can see in Figure 4.1 in this paper. Sample efficiency also goes up with the amoun... (read more)

3nostalgebraist
I have not finetuned GPT-3, but I have done a lot of finetuning with GPT-J 6.1B, which is similar in scale and performance to GPT-3 "Curie." In my experience, doing more than a single epoch is always harmful when finetuning GPT-J.  I initially thought it was beneficial on one specific dataset, but that turned out to be the exception that proves the rule.  I inspected per-token validation loss on that dataset over the course of training, and discovered that the train/val split was imperfect.  Training beyond the first epoch only helped on text that had been accidentally duplicated between train and val, and was harmful elsewhere.  In other words, it was "helpful" for exact memorization, but harmful for generalization. I have a wandb report here with some plots of this phenomenon.  I'm still not sure whether it's an indication of the sample efficiency associated with the ~6B scale, a quirk of GPT-J specifically, or (less plausibly) a quirk or bug in the codebase used to tune it. I did this work before OpenAI released their finetuning feature, and was surprised to find them defaulting to 4 epochs.  Especially given that their feature has a relatively tiny maximum dataset size.  My gut feeling is that 4 epochs is way too many, given a large model and only 2.5M tokens.