Fine-tuned Model: Fine-tune GPT-3 with supervised learning
Reward Model: Make a copy of the FT Model, remove the last layer and add a linear layer, train this as a reward model.
Policy: Make a copy of FT Model, train this with RL using the Reward Model.

2. and 3. can be alternated and the Policy from 3. is what is used as the final InstructGPT

My question: I wonder which of these stages contributes "the most" to the observed mode collapses/behavior.

If skip I 1. and just use GPT-3 without fine-tuning, how does this impact mode collapse?
What if I skip 2. and 3. and just use 1.?
What if I

... (read more)

3

0