The True Story of How GPT-2 Became Maximally Lewd
This video recounts an incident that occurred at OpenAI in which flipping a single minus sign led the RLHF process to make GPT-2 only output sexually explicit continuations. The incident is described in OpenAI's paper "Fine-Tuning Language Models from Human Preferences" under section 4.4: "Bugs can optimize for bad behavior"....
Humans get frustrated, bored and have very limited attention. LLM cognition is almost too cheap to meter and can parallelize very effectively on both the code itself and the kind of vulnerabilities its looking for.