Noa Nabeshima - LessWrong

Even with all possible prefixes included in every batch the toy model learns the same small mixing between parent and children (this was best out of 2, for the first run the matryoshka didn't represent one of the features): https://sparselatents.com/matryoshka_toy_all_prefixes.png

Here's a hypothesis that could explain most of this mixing. If the hypothesis is true, then even if every possible prefix is included in every batch, there will still be mixing.

Hypothesis:

Regardless of the number of prefixes, there will be some prefix loss terms where
1. a parent and child feature are active
2. the parent latent is included in the prefix
3. the child latent isn't included in the prefix.

The MSE loss in these prefix loss terms is pretty large because the child feature isn't represented at all. This nudges the parent to slightly represent all of its children a bit.

To compensate for this, if a child feature is active and the child latent is included the prefix, it undoes the parent decoder vector's contribution to the features of the parent's other children.

This could explain these weird properties of the heatmap:
- Parent decoder vector has small positive cosine similarity with child features
- Child decoder vectors have small negative cosine similarity with other child features

Still unexplained by this hypothesis:
- Child decoder vectors have very small negative cosine similarity with the parent feature.

Matryoshka Sparse Autoencoders

Noa Nabeshima3dΩ110

That's very cool, I'm looking forward to seeing those results! The Top-K extension is particularly interesting, as that was something I wasn't sure how to approach.

I imagine you've explored important directions I haven't touched like better benchmarking, top-k implementation, and testing on larger models. Having multiple independent validations of an approach also seems valuable.

I'd be interested in continuing this line of research, especially circuits with Matryoshka SAEs. I'd love to hear about what directions you're thinking of. Would you want to have a call sometime about collaboration or coordination? (I'll DM you!)

Really looking forward to reading your post!

Matryoshka Sparse Autoencoders

Noa Nabeshima5dΩ330

Yes, follow up work with bigger LMs seems good!

I use number of prefix-losses per batch = 10 here; I tried 100 prefixes per batch and the learned latents looked similar at a quick glance, so I wonder if naively training with block size = 1 might not be qualitatively different. I'm not that sure and training faster with kernels on its own seems good also!

Maybe if you had a kernel for training with block size = 1 it would create surface area for figuring out how to work on absorption when latents are right next to each other in the matryoshka latent ordering.

Book a Time to Chat about Interp Research

Noa Nabeshima15d50

I really enjoy chatting with Logan about interpretability research.

Parasites (not a metaphor)

Noa Nabeshima4mo20

It'd be funny if stomach gendlin focusing content was at least partially related to worms or other random physical stuff

Parasites (not a metaphor)

Noa Nabeshima4mo10

I tried this a week ago and I am constipated for what I think might be the first time in years but am not sure. I also think I might have less unpleasant stomach sensation overall.

Are the majority of your ancestors farmers or non-farmers?

Noa Nabeshima5mo10

23&me says I have more Neanderthal DNA than 96% of users and my DNA attribution is half-Japanese and half European. Your Neanderthal link doesn't work for me.

Efficient Dictionary Learning with Switch Sparse Autoencoders

Noa Nabeshima5mo103

Sometimes FLOP/s isn't the bottleneck for training models; e.g. it could be memory bandwidth. My impression from poking around with Nsight and some other observations is that wide SAEs might actually be FLOP/s bottlenecked but I don't trust my impression that much. I'd be interested in someone doing a comparison of this SAE architectures in terms of H100 seconds or something like that in addition to FLOP.

Did it seem to you like this architecture also trained faster in terms of wall-time?

Anyway, nice work! It's cool to see these results.

Sparse Autoencoders Work on Attention Layer Outputs

Noa Nabeshima10moΩ110

I wonder if multiple heads having the same activation pattern in a context is related to the limited rank per head; once the VO subspace of each head is saturated with meaningful directions/features maybe the model uses multiple heads to write out features that can't be placed in the subspace of any one head.

Why indoor lighting is hard to get right and how to fix it

Noa Nabeshima1y10

Do you have any updates on this? I'm interested in this.

LESSWRONG
is fundraising!
LW
$

Posts

Wiki Contributions

Comments