LESSWRONG
LW

Connor Kissane
377Ω202760
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
SAEs are highly dataset dependent: a case study on the refusal direction
Connor Kissane8mo20

Thanks! I'm not sure. My guess is that if you go super narrow, it may be more likely to result in an inconvenient level of "feature splitting". Since there are only a few total concepts to learn, an SAE of equivalent width might exploit its greater relative capacity to learn niche combinations of features (to reduce sparsity loss).

Reply
Base LLMs refuse too
Connor Kissane9mo31

LLaMA 1 7B definitely seems to be a "pure base model". I agree that we have less transparency into the pre-training of Gemma 2 and Qwen 1.5, and I'll add this as a limitation, thanks!

I've checked that Pythia 12b deduped (pre-trained on the pile) also refuses harmful requests, although at a lower rate (13%). Here's an example, using the following prompt template:

"""User: {instruction}

Assistant:"""

It's pretty dumb though, and often just outputs nonsense. When I give it the Vicuna system prompt, it refuses 100% of harmful requests, though it has a bunch of "incompetent refusals", similar to LLaMA 1 7B:

"""A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

USER: {instruction}

ASSISTANT:"""

Reply1
Sparse Autoencoders Work on Attention Layer Outputs
Connor Kissane1y11

Thanks for the comment! We always use the pre-ReLU feature activation, which is equal to the post-ReLU activation (given that the feature is activate), and is purely linear function of z. Edited the post for clarity. 

Reply
SAE-VIS: Announcement Post
Connor Kissane1y70

Amazing! We found your original library super useful for our Attention SAEs research, so thanks for making this!

Reply
Mech Interp Puzzle 1: Suspiciously Similar Embeddings in GPT-Neo
Connor Kissane2y10

These puzzles are great, thanks for making them!

Reply
Causal scrubbing: results on induction heads
Connor Kissane2y10

Code for this token filtering can be found in the appendix and the exact token list is linked.

Maybe I just missed it, but I'm not seeing this. Is the code still available? 

Reply
67SAEs are highly dataset dependent: a case study on the refusal direction
Ω
8mo
Ω
4
48Open Source Replication of Anthropic’s Crosscoder paper for model-diffing
Ω
8mo
Ω
4
60Base LLMs refuse too
Ω
9mo
Ω
20
67SAEs (usually) Transfer Between Base and Chat Models
Ω
1y
Ω
0
33Attention Output SAEs Improve Circuit Analysis
Ω
1y
Ω
3
63We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To
Ω
1y
Ω
0
78Attention SAEs Scale to GPT-2 Small
Ω
1y
Ω
4
85Sparse Autoencoders Work on Attention Layer Outputs
Ω
1y
Ω
9