robertzk

Resisting Reality

Sometimes updating on evidence opens roads we do not want to take: roads that we do not like as we know where they inevitably lead. We sometimes prefer to stay in homeostasis, in our current lane, suboptimal. One evocative example is the sort of paradoxical blend of invective mania and...

Jan 2226

robertzk's Shortform

Dec 8, 20255

SAEs are highly dataset dependent: a case study on the refusal direction

This is an interim report sharing preliminary results. We hope this update will be useful to related research occurring in parallel. Executive Summary * Problem: Qwen1.5 0.5B Chat SAEs trained on the pile (webtext) fail to find sparse, interpretable reconstructions of the refusal direction from Arditi et al. The most...

Nov 7, 202467

Open Source Replication of Anthropic’s Crosscoder paper for model-diffing

Intro Anthropic recently released an exciting mini-paper on crosscoders (Lindsey et al.). In this post, we open source a model-diffing crosscoder trained on the middle layer residual stream of the Gemma-2 2B base and IT models, along with code, implementation details / tips, and a replication of the core results...

Oct 27, 202448

Base LLMs refuse too

Executive Summary * Refusing harmful requests is not a novel behavior learned in chat fine-tuning, as pre-trained base models will also refuse requests (48% of all harmful requests, 3% of harmless) just at a lower rate than chat models (90% harmful, 3% harmless) * Further, for both Qwen 1.5 0.5B...

Sep 29, 202461

SAEs (usually) Transfer Between Base and Chat Models

This is an interim report sharing preliminary results that we are currently building on. We hope this update will be useful to related research occurring in parallel. Executive Summary * We train SAEs on base / chat model pairs and find that SAEs trained on the base model transfer surprisingly...

Jul 18, 202467

Attention Output SAEs Improve Circuit Analysis

This is the final post of our Alignment Forum sequence produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort. Executive Summary * In a previous post we trained Attention Output Sparse Autoencoders (SAEs) on every layer of GPT-2 Small. * Following that work, we...

Jun 21, 202433

robertzk

robertzk

Sparse Autoencoders Work on Attention Layer Outputs

Attention SAEs Scale to GPT-2 Small

SAEs are highly dataset dependent: a case study on the refusal direction

SAEs (usually) Transfer Between Base and Chat Models

robertzk

Sparse Autoencoders Work on Attention Layer Outputs

Attention SAEs Scale to GPT-2 Small

SAEs are highly dataset dependent: a case study on the refusal direction

SAEs (usually) Transfer Between Base and Chat Models

Resisting Reality

robertzk's Shortform

SAEs are highly dataset dependent: a case study on the refusal direction

Open Source Replication of Anthropic’s Crosscoder paper for model-diffing

Base LLMs refuse too

SAEs (usually) Transfer Between Base and Chat Models

Attention Output SAEs Improve Circuit Analysis