LESSWRONG
LW

nps29

Message

JumpReLU SAEs + Early Access to Gemma 2 SAEs

New paper from the Google DeepMind mechanistic interpretability team, led by Sen Rajamanoharan! We introduce JumpReLU SAEs, a new SAE architecture that replaces the standard ReLUs with discontinuous JumpReLU activations, and seems to be (narrowly) state of the art over existing methods like TopK and Gated SAEs for achieving high...

Jul 19, 2024•55

nps29

nps29 — LessWrong

nps29

Message

JumpReLU SAEs + Early Access to Gemma 2 SAEs

Jul 19, 2024•55

nps29

JumpReLU SAEs + Early Access to Gemma 2 SAEs

Senthooran Rajamanoharan

Senthooran Rajamanoharan, Tom Lieberum, nps29, Arthur Conmy, Vikrant Varma, János Kramár, Neel Nanda+ 0 more

Senthooran Rajamanoharan, Tom Lieberum, nps29, Arthur Conmy, Vikrant Varma, János Kramár, Neel Nanda

New paper from the Google DeepMind mechanistic interpretability team, led by Sen Rajamanoharan!

We introduce JumpReLU SAEs, a new SAE architecture that replaces the standard ReLUs with discontinuous JumpReLU activations, and seems to be (narrowly) state of the art over existing methods like TopK and Gated SAEs for achieving high reconstruction at a given sparsity level, without a hit to interpretability. We train through discontinuity with straight-through estimators, which also let us directly optimise the L0.

To accompany this, we will release the weights of hundreds of JumpReLU SAEs on every layer and sublayer of Gemma 2 2B and 9B in a few weeks. Apply now for early access to the 9B ones! We're... (read 247 more words →)