Jai

The True Story of How GPT-2 Became Maximally Lewd

This video recounts an incident that occurred at OpenAI in which flipping a single minus sign led the RLHF process to make GPT-2 only output sexually explicit continuations. The incident is described in OpenAI's paper "Fine-Tuning Language Models from Human Preferences" under section 4.4: "Bugs can optimize for bad behavior"....

Jan 18, 202474

LESSWRONG
LW

LESSWRONG
LW

Jai

The True Story of How GPT-2 Became Maximally Lewd

Jai

Jai

The True Story of How GPT-2 Became Maximally Lewd

Part I: GPT