LESSWRONG
is fundraising!
Tags
LW
$

RLHF

EditHistorySubscribe

Help improve this page (2 flags)

EditHistorySubscribe

Help improve this page (2 flags)

Contributors

You are viewing revision 1.5.0, last edited by Noosphere89

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique and an alignment technique where the model's training signal uses human evaluations of the model's outputs, rather than labeled data or a ground truth reward signal.

Posts tagged RLHF

8

250Thoughts on the impact of RLHF research

paulfchristiano

2y

102

Review

6

29[Link] Why I’m excited about AI-assisted human feedback

3y

0

5

120Compendium of problems with RLHF

Charbel-Raphaël

2y

16

4

632The Waluigi Effect (mega-post)

2y

188

Review

4

284Mysteries of mode collapse

2y

57

4

30Interpreting the Learning of Deceit

1y

14

3

106Trying to disambiguate different questions about whether RLHF is “good”

2y

47

2

98[Link] Why I’m optimistic about OpenAI’s alignment approach

2y

15

2

95RLHF does not appear to differentially cause mode-collapse

Arthur Conmy, beren

2y

9

2

71Update to Mysteries of mode collapse: text-davinci-002 not RLHF

2y

8

2

70The True Story of How GPT-2 Became Maximally Lewd

1y

7

2

68Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)

2y

11

2

68MetaAI: less is less for alignment.

2y

17

2

66Towards Understanding Sycophancy in Language Models

Ethan Perez, mrinank_sharma, Meg, Tomek Korbak

1y

0

Review

2

65Paper: The Capacity for Moral Self-Correction in Large Language Models (Anthropic)

2y

9