Interesting idea, I had not considered this approach before!

I'm not sure this would solve feature absorption though. Thinking about the "Starts with E-" and "Elephant" example: if the "Elephant" latent absorbs the "Starts with E-" latent, the "Starts with E-" feature will develop a hole and not activate anymore on the input "elephant". After the latent is absorbed, "Starts with E-" wouldn't be in the list to calculate cumulative losses for that input anymore.

Matryoshka works because it forces the early-indexed latents to reconstruct well using only themselves, whether or not later latents activate. I think this pressure is key to stopping the later-indexed latents from stealing the job of the early-indexed ones.

Reply

Learning Multi-Level Features with Matryoshka SAEs

Bart Bussmann4mo11

Although the code has the option to add a L1-penalty, in practice I set the l1_coeff to 0 in all my experiments (see main.py for all hyperparameters).

Reply

Hire (or Become) a Thinking Assistant

Bart Bussmann4mo40

I haven't actually tried this, but recently heard about focusbuddy.ai, which might be a useful ai assistant in this space.

Reply

Matryoshka Sparse Autoencoders

Bart Bussmann4moΩ20352

Great work! I have been working on something very similar and will publish my results here some time next week, but can already give a sneak-peak:

The SAEs here were only trained for 100M tokens (1/3 the TinyStories^[11:1] dataset). The language model was trained for 3 epochs on the 300M token TinyStories dataset. It would be good to validate these results with more 'real' language models and train SAEs with much more data.

I can confirm that on Gemma-2-2B Matryoshka SAEs dramatically improve the absorption score on the first-letter task from Chanin et al. as implemented in SAEBench!

Is there a nice way to extend the Matryoshka method to top-k SAEs?

Yes! My experiments with Matryoshka SAEs are using BatchTopK.

Are you planning to continue this line of research? If so, I would be interested to collaborate (or otherwise at least coordinate on not doing duplicate work).

Reply

Visible Thoughts Project and Bounty Announcement

Bart Bussmann5mo20

Three years later, and we actually got LLMs with visible thoughts, such as Deepseek, QwQ, and (although partially hidden from the user) o1-preview.

I (Nate) find it plausible that there are capabilities advances to be had from training language models on thought-annotated dungeon runs.

Good call!

Reply

Daniel Kokotajlo's Shortform

Bart Bussmann5mo83

Sing along! https://suno.com/song/35d62e76-eac7-4733-864d-d62104f4bfd0

Reply

1

Could we use current AI methods to understand dolphins?

Bart Bussmann6mo50

This project seems to be trying to translate whale language.

Reply

Canaletto's Shortform

Bart Bussmann7mo60

You might enjoy this classic: https://www.lesswrong.com/posts/9HSwh2mE3tX6xvZ2W/the-pyramid-and-the-garden

Reply

[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

Bart Bussmann7mo60

Rather than doubling down on a single single-layered decomposition for all activations, why not go with a multi-layered decomposition (ie: some combination of SAE and metaSAE, preferably as unsupervised as possible). Or alternatively, maybe the decomposition that is most useful in each case changes and what we really need is lots of different (somewhat) interpretable decompositions and an ability to quickly work out which is useful in context.

Definitely seems like multiple ways to interpret this work, as also described in SAE feature geometry is outside the superposition hypothesis. Either we need to find other methods and theory that somehow finds more atomic features, or we need to get a more complete picture of what the SAEs are learning at different levels of abstraction and composition.

Both seem important and interesting lines of work to me!

Reply

1

[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

Bart Bussmann7mo80

Great work! Using spelling is very clear example of how information gets absorbed in the SAE latent, and indeed in Meta-SAEs we found many spelling/sound related meta-latents.

I have been thinking a bit on how to solve this problem and one experiment that I would like to try is to train an SAE and a meta-SAE concurrently, but in an adversarial manner (kind of like a GAN), such that the SAE is incentivized to learn latent directions that are not easily decomposable by the meta-SAE.

Potentially, this would remove the "Starts-with-L"-component from the "lion"-token direction and activate the "Starts-with-L" latent instead. Although this would come at the cost of worse sparsity/reconstruction.

Reply