This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
Tags
LW
Login
MATS Program
•
Applied to
Debating with More Persuasive LLMs Leads to More Truthful Answers
by
Ryan Kidd
15d
ago
•
Applied to
Automating LLM Auditing with Developmental Interpretability
by
DanielFilan
20d
ago
•
Applied to
SAE Probing: What is it good for? Absolutely something!
by
Subhash Kantamneni
20d
ago
•
Applied to
Bridging the VLM and mech interp communities for multimodal interpretability
by
Sonia Joseph
24d
ago
•
Applied to
The slingshot helps with learning
by
Wilson Wu
26d
ago
•
Applied to
Improving Model-Written Evals for AI Safety Benchmarking
by
Sunishchal Dev
1mo
ago
•
Applied to
On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback
by
Marcus Williams
1mo
ago
•
Applied to
Standard SAEs Might Be Incoherent: A Choosing Problem & A “Concise” Solution
by
Kola Ayonrinde
1mo
ago
•
Applied to
[Job Ad] MATS is hiring!
by
Jana
1mo
ago
•
Applied to
MATS AI Safety Strategy Curriculum v2
by
Ryan Kidd
1mo
ago
•
Applied to
Domain-specific SAEs
by
jacob_drori
1mo
ago
•
Applied to
[Interim research report] Evaluating the Goal-Directedness of Language Models
by
Rauno Arike
2mo
ago
•
Applied to
MATS Alumni Impact Analysis
by
Ryan Kidd
2mo
ago
•
Applied to
The Geometry of Feelings and Nonsense in Large Language Models
by
Ryan Kidd
2mo
ago
•
Applied to
Apply to MATS 7.0!
by
Ryan Kidd
2mo
ago
•
Applied to
Calendar feature geometry in GPT-2 layer 8 residual stream SAEs
by
Ryan Kidd
2mo
ago
•
Applied to
Showing SAE Latents Are Not Atomic Using Meta-SAEs
by
Ryan Kidd
2mo
ago
•
Applied to
Experiments with an alternative method to promote sparsity in sparse autoencoders
by
Ryan Kidd
3mo
ago
•
Applied to
Crafting Polysemantic Transformer Benchmarks with Known Circuits
by
Ryan Kidd
3mo
ago
•
Applied to
Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs
by
Kola Ayonrinde
3mo
ago