Miles Wang's Shortform
Dec 15, 20231
Overview We tried to understand at a circuit level how Llama-2 models perform a simple task: not replying with a word when told not to. We mostly failed at fully reverse engineering the responsible circuit, but in the process we learned a few things about mechanistic interpretability. We share our...