LESSWRONG
is fundraising!
Tags
LW
$

RLHF

EditHistorySubscribe

Discussion (9)

Help improve this page (2 flags)

EditHistorySubscribe

Discussion (9)

Help improve this page (2 flags)

Contributors

You are viewing revision 1.6.0, last edited by RobertM

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique where the model's training signal uses human evaluations of the model's outputs, rather than labeled data or a ground truth reward signal.

Posts tagged RLHF

250Thoughts on the impact of RLHF research

paulfchristiano

102

Review

29[Link] Why I’m excited about AI-assisted human feedback

janleike

120Compendium of problems with RLHF

Charbel-Raphaël

632The Waluigi Effect (mega-post)

Cleo Nardo

188

Review

284Mysteries of mode collapse

janus

30Interpreting the Learning of Deceit

RogerDearnaley

106Trying to disambiguate different questions about whether RLHF is “good”

Buck

98[Link] Why I’m optimistic about OpenAI’s alignment approach

janleike

95RLHF does not appear to differentially cause mode-collapse

Arthur Conmy, beren

71Update to Mysteries of mode collapse: text-davinci-002 not RLHF

janus

70The True Story of How GPT-2 Became Maximally Lewd

Writer, Jai

68MetaAI: less is less for alignment.

Cleo Nardo

68Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)

LawrenceC

66Towards Understanding Sycophancy in Language Models

Ethan Perez, mrinank_sharma, Meg, Tomek Korbak

Review

65Paper: The Capacity for Moral Self-Correction in Large Language Models (Anthropic)

LawrenceC