x
This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
LW
Login
GDM Interp Progress Updates — LessWrong
GDM Interp Progress Updates
73
[Summary] Progress Update #1 from the GDM Mech Interp Team
Ω
Neel Nanda
,
Arthur Conmy
,
lewis smith
,
Senthooran Rajamanoharan
,
Tom Lieberum
,
János Kramár
,
Vikrant Varma
2y
Ω
0
80
[Full Post] Progress Update #1 from the GDM Mech Interp Team
Ω
Neel Nanda
,
Arthur Conmy
,
lewis smith
,
Senthooran Rajamanoharan
,
Tom Lieberum
,
János Kramár
,
Vikrant Varma
2y
Ω
10
48
The GDM AGI Safety+Alignment Team is Hiring for Applied Interpretability Research
Ω
Arthur Conmy
,
Neel Nanda
1y
Ω
1
117
Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)
Ω
lewis smith
,
Senthooran Rajamanoharan
,
Arthur Conmy
,
CallumMcDougall
,
Tom Lieberum
,
János Kramár
,
Rohin Shah
,
Neel Nanda
1y
Ω
15
139
A Pragmatic Vision for Interpretability
Ω
Neel Nanda
,
Josh Engels
,
Arthur Conmy
,
Senthooran Rajamanoharan
,
bilalchughtai
,
CallumMcDougall
,
János Kramár
,
lewis smith
7mo
Ω
39
68
How Can Interpretability Researchers Help AGI Go Well?
Ω
Neel Nanda
,
Josh Engels
,
Senthooran Rajamanoharan
,
Arthur Conmy
,
bilalchughtai
,
CallumMcDougall
,
János Kramár
,
lewis smith
7mo
Ω
1
87
Models May Behave Worse When Eval Aware
Ω
Senthooran Rajamanoharan
,
Neel Nanda
10d
Ω
7
61
Building and evaluating model diffing agents
Ω
bilalchughtai
,
Josh Engels
,
Neel Nanda
9d
Ω
2
69
SFT Drives Gemini’s Safety Properties
Ω
Josh Engels
,
Arthur Conmy
,
bilalchughtai
,
Neel Nanda
8d
Ω
3
50
Why Do Naive SFT Filters For Safety Properties Fail?
Ω
Josh Engels
,
Neel Nanda
7d
Ω
7
57
Synthetic document finetuning for instilling positive traits
Ω
CallumMcDougall
,
Arthur Conmy
,
Neel Nanda
5d
Ω
1