Epistemic status: confident but not certain. This post is part of the work done at Conjecture. Thanks to Sid Black and Alexandre Variengien for feedback that greatly improved the post.

TL;DR: the results in Mysteries of mode collapse do not reproduce in text-davinci-003, a model trained with RLHF. In fact, there are cases where RLHF models exhibit higher entropy outputs than base models. We observe that the mode collapse phenomenon occurs more for the public OpenAI GPT-3 model trained with supervised finetuning (text-davinci-002) than RLHF, and present early experiments and theory to support this.

Background

Mysteries of mode collapse details how "mode collapse" (which we operationalize as a large increase in model output confidence and decreases in entropy of output distribution) arises more in text-davinci-002 than the base model davinci, and speculates about how this connects to RLHF training. At the time, OpenAI was very unclear on the training process for this model, and later (as @janus points out in the edited introduction to the post) it was revealed that this model was finetuned on highly-rated samples rather than trained with RLHF. However, the connection between RLHF and mode collapse has stuck, and several posts written since assume a connection.

Results

In this section, we compare the base model (~~davinci~~ code-davinci-002, thanks commenters!) with the supervised fine-tuned model (text-davinci-002) and the RLHF model (text-davinci-003). We recommend trying some prompts for yourself in the OpenAI playground. The first result is that the mode collapse to “ 97” for the completion of the first prompt from @janus’ does not occur in the RLHF model:

In fact, when we try another prompt^[1] we get that the base model has the lowest entropy:

(ETA: this result is somewhat weaker than hoped, since text-davinci-002 seems to not output " 0" - " 100" here. davinci does exhibit collapses on other prompts, but commenters pointed out this is not the base model)

The finding that mode collapse occurs in finetuned models is not robust. Comparing two of the prompts from the original post and two more, ^[1] there is no noticeable pattern where the base model has higher entropy than the other models:

(the uncertainty bars represent the maximum possible entropy if the model had uniform probability on all to...