This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
Wikitags
LW
Login
Machine Learning (ML)
Settings
Applied to
Reducing LLM deception at scale with self-other overlap fine-tuning
by
Gunnar_Zarncke
3d
ago
Applied to
Observations on self-supervised Learning for vision
by
Dinkar Juyal
7d
ago
Applied to
Estimating the Probability of Sampling a Trained Neural Network at Random
by
Adam Scherlis
16d
ago
Applied to
Self-fulfilling misalignment data might be poisoning our AI models
by
TurnTrout
16d
ago
Applied to
How to Contribute to Theoretical Reward Learning Research
by
Joar Skalse
16d
ago
Applied to
Other Papers About the Theory of Reward Learning
by
Joar Skalse
16d
ago
Applied to
Defining and Characterising Reward Hacking
by
Joar Skalse
16d
ago
Applied to
Misspecification in Inverse Reinforcement Learning
by
Joar Skalse
16d
ago
Applied to
The Theoretical Reward Learning Research Agenda: Introduction and Motivation
by
Joar Skalse
16d
ago
Applied to
Technical comparison of Deepseek, Novasky, S1, Helix, P0
by
Juliezhanggg
19d
ago
Applied to
The GDM AGI Safety+Alignment Team is Hiring for Applied Interpretability Research
by
Arthur Conmy
21d
ago
Applied to
From No Mind to a Mind – A Conversation That Changed an AI
by
parthibanarjuna s
1mo
ago
Applied to
What If AI Recognized Meaning? An Inquiry into "Resonant Recognition"
by
JD___
1mo
ago
Applied to
DeepSeek-R1 for Beginners
by
Anton Razzhigaev
1mo
ago
Applied to
Sleeper agents appear resilient to activation steering
by
Lucy Wingard
1mo
ago
Applied to
Machine Unlearning in Large Language Models: A Comprehensive Survey with Empirical Insights from the Qwen 1.5 1.8B Model
by
Saketh Baddam
1mo
ago
Applied to
How do biological or spiking neural networks learn?
by
Dom Polsinelli
1mo
ago