x

LESSWRONG
LW

Miles Wang — LessWrong

Miles Wang

Miles Wang

Message

29

1

1

2y

Miles Wang

29

2y

;

Miles Wang's Shortform

Dec 15, 2023•1

Takeaways from a Mechanistic Interpretability project on “Forbidden Facts”

Overview We tried to understand at a circuit level how Llama-2 models perform a simple task: not replying with a word when told not to. We mostly failed at fully reverse engineering the responsible circuit, but in the process we learned a few things about mechanistic interpretability. We share our...

Dec 15, 2023•34