LESSWRONG
Wikitags
LW

Reinforcement learning

Settings

Applied to Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? by Matrice Jacobine 5d ago

Applied to Reward hacking is becoming more sophisticated and deliberate in frontier LLMs by Kei 5d ago

Applied to “The Era of Experience” has an unsolved technical alignment problem by Steven Byrnes 5d ago

Applied to Alignment Does Not Need to Be Opaque! An Introduction to Feature Steering with Reinforcement Learning by Jeremias Ferrao 11d ago

Applied to The Theoretical Reward Learning Research Agenda: Introduction and Motivation by Joar Skalse 2mo ago

Applied to How to Contribute to Theoretical Reward Learning Research by Joar Skalse 2mo ago

Applied to Other Papers About the Theory of Reward Learning by Joar Skalse 2mo ago

Applied to Defining and Characterising Reward Hacking by Joar Skalse 2mo ago

Applied to Misspecification in Inverse Reinforcement Learning - Part II by Joar Skalse 2mo ago

Applied to Toward a Mathematical Definition of Rationality in Multi-Agent Systems by nekofugu 2mo ago

changed tabTitle from LW Wiki to Arbital

changed deleted from false to true

changed title from LW Wiki to Arbital

Lens: Arbital

RobertM v1.1.0Feb 19th 2025 GMT 0

Applied to Unlocking Ethical AI and Improving Jailbreak Defenses: Reinforcement Learning with Layered Morphology (RLLM) by MiguelDev 3mo ago

Applied to Why Aligning an LLM is Hard, and How to Make it Easier by RogerDearnaley 3mo ago

Applied to Announcement: Learning Theory Online Course by Yegreg 3mo ago

Applied to Turning up the Heat on Deceptively-Misaligned AI by J Bostock 4mo ago

Dakara v1.20.0Dec 30th 2024 GMT (+24/-70) 1

Applied to Human, All Too Human - Superintelligence requires learning things we can’t teach by Ben Turtel 4mo ago