LESSWRONG
LW

Wikitags

RLHF

Edited by Multicore, et al. last updated 2nd Oct 2024

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique where the model's training signal uses human evaluations of the model's outputs, rather than labeled data or a ground truth reward signal.

Subscribe
1
Subscribe
1
Discussion9
Discussion9
Posts tagged RLHF
253Thoughts on the impact of RLHF research
Ω
paulfchristiano
3y
Ω
102
29[Link] Why I’m excited about AI-assisted human feedback
Ω
janleike
3y
Ω
0
120Compendium of problems with RLHF
Charbel-Raphaël
3y
16
631The Waluigi Effect (mega-post)
Ω
Cleo Nardo
3y
Ω
188
284Mysteries of mode collapse
Ω
janus
3y
Ω
57
30Interpreting the Learning of Deceit
Ω
RogerDearnaley
2y
Ω
14
108Trying to disambiguate different questions about whether RLHF is “good”
Ω
Buck
3y
Ω
47
98[Link] Why I’m optimistic about OpenAI’s alignment approach
Ω
janleike
3y
Ω
15
95RLHF does not appear to differentially cause mode-collapse
Arthur Conmy, beren
2y
9
95On the functional self of LLMs
Ω
eggsyntax
2mo
Ω
35
71MetaAI: less is less for alignment.
Ω
Cleo Nardo
2y
Ω
17
71Update to Mysteries of mode collapse: text-davinci-002 not RLHF
Ω
janus
3y
Ω
8
70The True Story of How GPT-2 Became Maximally Lewd
Writer, Jai
2y
7
68Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)
Ω
LawrenceC
3y
Ω
11
66Towards Understanding Sycophancy in Language Models
Ω
Ethan Perez, mrinank_sharma, Meg, Tomek Korbak
2y
Ω
0
Load More (15/77)
Add Posts