This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
Tags
LW
Login
RLHF
•
Applied to
Run evals on base models too!
by
orthonormal
1mo
ago
•
Applied to
Why do we need RLHF? Imitation, Inverse RL, and the role of reward
by
Ran W
3mo
ago
•
Applied to
The case for more ambitious language model evals
by
Jozdien
3mo
ago
•
Applied to
The True Story of How GPT-2 Became Maximally Lewd
by
Writer
4mo
ago
•
Applied to
Interpreting the Learning of Deceit
by
RogerDearnaley
5mo
ago
•
Applied to
Artefacts generated by mode collapse in GPT-4 Turbo serve as adversarial attacks.
by
Sohaib Imran
6mo
ago
•
Applied to
Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation
by
Soroush Pour
6mo
ago
•
Applied to
Paul Christiano on Dwarkesh Podcast
by
ESRogs
6mo
ago
•
Applied to
Wireheading and misalignment by composition on NetHack
by
pierlucadoro
6mo
ago
•
Applied to
Compositional preference models for aligning LMs
by
Tomek Korbak
6mo
ago
•
Applied to
Towards Understanding Sycophancy in Language Models
by
Ethan Perez
6mo
ago
•
Applied to
VLM-RM: Specifying Rewards with Natural Language
by
ChengCheng
7mo
ago
•
Applied to
unRLHF - Efficiently undoing LLM safeguards
by
Pranav Gade
7mo
ago
•
Applied to
LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B
by
Simon Lermen
7mo
ago
•
Applied to
Censorship in LLMs is here to stay because it mirrors how our own intelligence is structured
by
mnvr
7mo
ago
•
Applied to
Beginner's question about RLHF
by
Ruby
9mo
ago