LESSWRONGTags
LW

MATS Program

•

Applied to Reward hacking behavior can generalize across tasks by Ryan Kidd 4d ago

•

Applied to How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions by Ryan Kidd 7d ago

•

Applied to Talent Needs of Technical AI Safety Teams by yams 8d ago

•

Applied to Infra-Bayesian haggling by Ryan Kidd 10d ago

•

Applied to Language Models Model Us by eggsyntax 15d ago

•

Applied to MATS Winter 2023-24 Retrospective by Rocket 21d ago

•

Applied to Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers by Ryan Kidd 1mo ago

•

Applied to Mechanistically Eliciting Latent Behaviors in Language Models by TurnTrout 1mo ago

•

Applied to Transcoders enable fine-grained interpretable circuit analysis for language models by Jacob Dunefsky 1mo ago

•

Applied to End-to-end hacking with language models by tchauvin 2mo ago

•

Applied to Ophiology (or, how the Mamba architecture works) by Danielle Ensign 2mo ago

•

Applied to My MATS Summer 2023 experience by James Chua 2mo ago

•

Applied to Understanding SAE Features with the Logit Lens by Joseph Bloom 3mo ago

•

Applied to MATS AI Safety Strategy Curriculum by Ryan Kidd 3mo ago

•

Applied to We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To by robertzk 3mo ago

•

Applied to Implementing activation steering by Annah 4mo ago

•

Applied to Attention SAEs Scale to GPT-2 Small by robertzk 4mo ago

•

Applied to How important is AI hacking as LLMs advance? by Artyom Karpov 4mo ago