Hi Sonia - Can you please explain what you mean by "mixed selectivity"; particularly, I don't understand what you mean by "Some of these studies then seem to conclude that SAEs alleviate superposition when really they may alleviate mixed selectivity." Thanks.

Reply

StefanHex's Shortform

RGRGRG5mo10

I like this recent post about atomic meta-SAE features, I think these are much closer (compared against normal SAEs) to what I expect atomic units to look like:

https://www.lesswrong.com/posts/TMAmHh4DdMr4nCSr5/showing-sae-latents-are-not-atomic-using-meta-saes

Reply

Growth and Form in a Toy Model of Superposition

RGRGRG5mo10

Would you be willing to share the raw data from the plot - "Developmental Stages of TMS", I'm specifically hoping to look at line plots of weights vs biases over time.

Thanks.

Reply

There Should Be More Alignment-Driven Startups

RGRGRG8mo10

What are the terms of the seed funding prize(s)?

Reply

Mechanistically Eliciting Latent Behaviors in Language Models

RGRGRG9mo10

Enjoyed this post! Quick question about obtaining the steering vectors:

Do you train them one at a time, possibly adding an additional orthogonality constraint between each train?

Reply

Transcoders enable fine-grained interpretable circuit analysis for language models

RGRGRG9mo10

Question about the "rules of the game" you present. Are you allowed to simply look at layer 0 transcoder features for the final 10 tokens - you could probably roughly estimate the input string from these features' top activators. From you case study, it seems that you effectively look at layer 0 transcoder features for a few of the final tokens through a backwards search, but wonder if you can skip the search and simply look at transcoder features. Thank you.

Reply

Finding Sparse Linear Connections between Features in LLMs

RGRGRG1yΩ110

To confirm - the weights you share, such as 0.26 and 0.23 are each individual entries in the W matrix for:
y=Wx ?

Reply

Growth and Form in a Toy Model of Superposition

RGRGRG1y30

This is a casual thought and by no means something I've thought hard about - I'm curious whether b is a lagging indicator, which is to say, there's actually more magic going on in the weights and once weights go through this change, b catches up to it.

Another speculative thought, let's say we are moving from 4* -> 5* and |W_3| is the new W that is taking on high magnitude. Does this occur because somehow W_3 has enough internal individual weights to jointly look at it's two (new) neighbors' W_i`s roughly equally?

Does the cos similarity and/or dot product of this new W_3 with its neighbors grow during the 4* -> 5* transition (and does this occur prior to the change in b?)

Reply

Growth and Form in a Toy Model of Superposition

RGRGRG1y30

Question about the gif - to me it looks like the phase transition is more like:

4++- to unstable 5+- to 4+- to 5-
(Unstable 5+- seems to have similar loss to 4+-).

Why do we not count the large red bar as a "-" ?

Reply

LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B

RGRGRG1y40

Do you expect similar results (besides the fact that it would take longer to train / cost more) without using LoRA?

Reply