This is a linkpost for https://openai.com/blog/instruction-following/#frog
I'll link to the following post that came out a little bit earlier Mysteries of Mode Collapse due to RLHF, which is basically a critique of the whole RLHF approach and the Instruct Models (specifically text-davinci-002).
OpenAI has given us a first taste of what success looks like in outer alignment, and that's InstructGPT. More specifically, InstructGPT actually is more aligned in two important ways: Truthfulness and hallucinations.
Specifically, OpenAI used RLHF to make GPT-3's variant, InstructGPT, to make sure it internalized human preferences. And from the looks of it, it was partially successful, it reduced but didn't eliminate hallucinations and untruthfulness.
However, this is a success story, so long as we recognize it's limitations:
Limitations of this instantiation of RLHF
The big limiter is that it does basically nothing on the inner alignment problem. So if a mesa-optimizer appeared, there would be no way to solve it using RLHF.
A second limit is that this is early stage work, so alignment results are limited, understandably.
And 3r, this is a subhuman intelligence model, and scaling remains a concern.
But it's a first success, so let's cheer on OpenAI.
Irrelevant text from the link
There's some irrelevant results about diversity concerns and who it should be aligned to, but that doesn't matter too much. Here's the text: