2127636907699376327957344805942353 has not written any posts yet.
The OpenAI style of RLHF progresses in 3 stages:
2. and 3. can be alternated and the Policy from 3. is what is used as the final InstructGPT
My question: I wonder which of these stages contributes "the most" to the observed mode collapses/behavior.
The OpenAI style of RLHF progresses in 3 stages:
2. and 3. can be alternated and the Policy from 3. is what is used as the final InstructGPT
My question: I wonder which of these stages contributes "the most" to the observed mode collapses/behavior.
- If skip I 1. and just use GPT-3 without fine-tuning, how does this impact mode collapse?
- What if I skip 2. and 3. and just use 1.?
- What if I
... (read more)