Noa Nabeshima - LessWrong

One barrier to SAE circuits is that it's currently hard to understand how attention out SAE latents are calculated. Even if you do IG attribution patching to try to understand which earlier latents are relevant to the attention out SAE latents, it doesn't tell you how these latents interact inside the attention layer at all.

Noa Nabeshima's Shortform

Noa Nabeshima2mo20

Auto-interp is currently really really bad

I think o1 is the only model that seems to perform decently at auto-interp but it's very expensive! IE $1/latent label. This is frustrating to me.

Noa Nabeshima's Shortform

Noa Nabeshima2mo*30

TinyModel SAEs have these first entity and second entity latents.

E.g. if the story is 'Once upon a time Tim met his friend Sally.', Tim is the first entity and Sally is the second entity. The latents fire on all instances of first|second entity after the first introduction of that entity.

I think I at one point found an 'object owned by second entity' latent but have had trouble finding it again.

I wonder if LMs are generating these reusable 'pointers' and then doing computation with the pointers. For example to track that an object is owned by the first entity, you just need to calculate which entities are instances of the first entity, calculate when first entity is shown to own an object and write 'owned by first entity' to the object token, and then broadcast that forward to other instances of the object. Then, if you have the tokens Tim|'s

(and 's has calculated that the first entity is immediately before it), 's can, with a single attention head, look for objects owned by the first entity.

This means that the exact identity information of the object (e.g. ' hammer') and the exact identity information of the first entity (' Tim') don't need to be passed around in computations, you can just do much cheaper pointer calculations and grab the relevant identity information when necessary.

This suggests a more fine-grained story for what duplicate name heads are doing in IOI.

Matryoshka Sparse Autoencoders

Noa Nabeshima3moΩ220

You know I was thinking ab this-- say that there are two children and they're orthogonal to the parent and each have probability 0.4 given the parent. If you imagine the space it looks like three clusters, two with probability 0.4, norm 1.4 and one with probability 0.2 and norm 1. They all have high cosine similarity with each other. From this frame, having the parent 'include' the children directions a bit doesn't seem that inappropriate. One SAE latent setup that seems pretty reasonable is to actually have one parent latent that's like "one of these three clusters is active" and three child latents pointing to each of the three clusters. The parent latent decoder in that setup would also include a bit of the child feature directions.

This is all sketchy though. It doesn't feel like we have a good answer to the question "How exactly do we want the SAEs to behave in various scenarios?"

Matryoshka Sparse Autoencoders

Noa Nabeshima3moΩ330

This is cool! I wonder if it can be fixed. I imagine it could be improved some amount by nudging the prefix distribution, but it doesn't seem like that will solve it properly. Curious if this is a large issue in real LMs. It's frustrating that there aren't ground-truth features we have access to in language models.

I think how large of a problem this is can probably be inferred from a description of the feature distribution. It'd be nice to have a better sense of what that distribution is (assuming the paradigm is correct enough).

Matryoshka Sparse Autoencoders

Noa Nabeshima3mo*Ω332

Tree Methodology

To generate the trees in sparselatents.com/tree_view, I use a variant of Masked Cosine Similarity (MCS), a metric introduced in Towards Monosemanticity. The original MCS is calculated like this: For any two latents A and B, first compute the cosine similarity between their activations, but only considering tokens where latent A is active. Then compute the same similarity, but only for tokens where latent B is active. The final MCS value is the larger of these two similarities.

Instead of taking the max, I do a directed MCS where I just consider the cosine similarity between A and B's activations on tokens where B is active. Then, I multiply this directed MCS score by max(B activations)/max(A activations) to ignore latents that don't fire very much. I'm not sure that this multiplication step is necessary.

I also use a higher threshold of 0.6.

Starting from a parent latent, say S/1/12, I find all latents in a larger-width SAE (say S/2) that pass the directed MCS threshold. Then, I re-apply the method to those S/2 latents to find children in S/3.

The result is often a non-tree DAG as some of the identified latents in S/3 have more than one parent in S/2. To simplify rendering, I assign these latents to the parent they have the highest score with. This obscures the true structure, but I wasn't sure of a clean way to automatically render these DAGs.

The trees should be thought of not as fully displaying the structure of the model, but instead of surfacing small sets of latents that I expect demonstrate feature absorption when viewed together.

Matryoshka Sparse Autoencoders

Noa Nabeshima3moΩ110

The Matryoshka SAE trained on the toy model learn the true features on most runs, not all of them. Sometimes a small number of latents, modally one, seem to get stuck in a bad state.

Matryoshka Sparse Autoencoders

Noa Nabeshima3moΩ332

I thought I had some informal evidence that permuting the latents was good and after double checking some evidence I don't feel confident that it is good.

Training without permutation seems to attain slightly better FVU/L0, has reasonable looking features at a quick glance, seems to solve the toy model at comparable rates to permuted, and is simpler to code.

Learning Multi-Level Features with Matryoshka SAEs

Noa Nabeshima3mo*Ω230

This is great work! I like that you tested on large models and your very comprehensive benchmarking. I also like the BatchTopK architecture.

It's interesting to me that MSE has a smaller hit than cross-entropy.

Here are some notes I made:

We suspect that using a fixed group size leads to more stable training and faster convergence.

This seems plausible to me!

Should the smallest sub-SAE get gradients from all losses, or should the losses from larger sub-SAEs be stopped?

When I tried stopping the gradient from flowing from large sub-SAE losses to small it made later latents much less interpretable. I tried an approach where early latents got less gradient from larger sub-SAE losses and it seemed to also have less interpretable late latents. I don't know what's going on with this.

What is the effect of latent sorting that Noa uses on the benchmarks?

I tried not ordering the latents and it did comparably on FVU/L0. I vaguely recall that for mean-max correlation, permuting did worse on early latents and better on the medium latents. At a quick glance I weakly preferred the permuted SAE latents but it was very preliminary and I'm not confident in this.

I'd love to chat more with the authors, I think it'd be fun to explore our beliefs and process over the course of making the papers and compare notes and ideas.

Matryoshka Sparse Autoencoders

Noa Nabeshima3mo*Ω350

Even with all possible prefixes included in every batch the toy model learns the same small mixing between parent and children (this was best out of 2, for the first run the matryoshka didn't represent one of the features): https://sparselatents.com/matryoshka_toy_all_prefixes.png

Here's a hypothesis that could explain most of this mixing. If the hypothesis is true, then even if every possible prefix is included in every batch, there will still be mixing.

Hypothesis:

Regardless of the number of prefixes, there will be some prefix loss terms where
1. a parent and child feature are active
2. the parent latent is included in the prefix
3. the child latent isn't included in the prefix.

The MSE loss in these prefix loss terms is pretty large because the child feature isn't represented at all. This nudges the parent to slightly represent all of its children a bit.

To compensate for this, if a child feature is active and the child latent is included the prefix, it undoes the parent decoder vector's contribution to the features of the parent's other children.

This could explain these weird properties of the heatmap:
- Parent decoder vector has small positive cosine similarity with child features
- Child decoder vectors have small negative cosine similarity with other child features

Still unexplained by this hypothesis:
- Child decoder vectors have very small negative cosine similarity with the parent feature.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Tree Methodology