This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
Tags
LW
Login
MATS Program
•
Applied to
Reward hacking behavior can generalize across tasks
by
Ryan Kidd
4d
ago
•
Applied to
How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
by
Ryan Kidd
7d
ago
•
Applied to
Talent Needs of Technical AI Safety Teams
by
yams
8d
ago
•
Applied to
Infra-Bayesian haggling
by
Ryan Kidd
10d
ago
•
Applied to
Language Models Model Us
by
eggsyntax
15d
ago
•
Applied to
MATS Winter 2023-24 Retrospective
by
Rocket
21d
ago
•
Applied to
Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers
by
Ryan Kidd
1mo
ago
•
Applied to
Mechanistically Eliciting Latent Behaviors in Language Models
by
TurnTrout
1mo
ago
•
Applied to
Transcoders enable fine-grained interpretable circuit analysis for language models
by
Jacob Dunefsky
1mo
ago
•
Applied to
End-to-end hacking with language models
by
tchauvin
2mo
ago
•
Applied to
Ophiology (or, how the Mamba architecture works)
by
Danielle Ensign
2mo
ago
•
Applied to
My MATS Summer 2023 experience
by
James Chua
2mo
ago
•
Applied to
Understanding SAE Features with the Logit Lens
by
Joseph Bloom
3mo
ago
•
Applied to
MATS AI Safety Strategy Curriculum
by
Ryan Kidd
3mo
ago
•
Applied to
We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To
by
robertzk
3mo
ago
•
Applied to
Implementing activation steering
by
Annah
4mo
ago
•
Applied to
Attention SAEs Scale to GPT-2 Small
by
robertzk
4mo
ago
•
Applied to
How important is AI hacking as LLMs advance?
by
Artyom Karpov
4mo
ago