x

LESSWRONG

LW

GDM Interp Progress Updates — LessWrong

GDM Interp Progress Updates

Apr 19, 2024 by Neel Nanda

73[Summary] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lewis smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár, Vikrant Varma

2y

0

80[Full Post] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lewis smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár, Vikrant Varma

2y

10

48The GDM AGI Safety+Alignment Team is Hiring for Applied Interpretability Research

Arthur Conmy, Neel Nanda

1y

1

117Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

lewis smith, Senthooran Rajamanoharan, Arthur Conmy, CallumMcDougall, Tom Lieberum, János Kramár, Rohin Shah, Neel Nanda

1y

15

139A Pragmatic Vision for Interpretability

Neel Nanda, Josh Engels, Arthur Conmy, Senthooran Rajamanoharan, bilalchughtai, CallumMcDougall, János Kramár, lewis smith

7mo

39

68How Can Interpretability Researchers Help AGI Go Well?

Neel Nanda, Josh Engels, Senthooran Rajamanoharan, Arthur Conmy, bilalchughtai, CallumMcDougall, János Kramár, lewis smith

7mo

1

87Models May Behave Worse When Eval Aware

Senthooran Rajamanoharan, Neel Nanda

10d

7

61Building and evaluating model diffing agents

bilalchughtai, Josh Engels, Neel Nanda

9d

2

69SFT Drives Gemini’s Safety Properties

Josh Engels, Arthur Conmy, bilalchughtai, Neel Nanda

8d

3

50Why Do Naive SFT Filters For Safety Properties Fail?

Josh Engels, Neel Nanda

7d

7

57Synthetic document finetuning for instilling positive traits

CallumMcDougall, Arthur Conmy, Neel Nanda

5d

1