LESSWRONG
LW

Taras Kutsyk — LessWrong

Replying toWhat is the functional role of SAE errors?

What is the functional role of SAE errors?

Thank you Clément, your hypothesis about the linear component is quite intriguing! I read Josh's Dark Matter paper a while ago, and I remember there were multiple versions floating around at the time, so I'd definitely like to revisit the latest one before responding in depth.

That said, I can comment based on your explanation already. The main motivation I had behind the restoration experiment was to test the idea that, without the error nodes, the key features in the circuit simply can't be computed. So when you say that the restoration effect may come from the non-linear component, do you mean that "the key features can't be computed" might be caused simply... (read more)

Replying toWhat is the functional role of SAE errors?

Taras Kutsyk8mo

What is the functional role of SAE errors?

corrected, thanks for pointing it out!

What is the functional role of SAE errors?

Taras Kutsyk

Taras Kutsyk, Tim Hua, woog, Andre Assis

8mo

TL;DR:

We explored the role of Sparse Autoencoder (SAE) errors in two different contexts for Gemma-2 2B and Gemma Scope SAEs: sparse feature circuits (subject-verb-agreement-across-relative clause) and linear probing.
Circuit investigation: While ablating residual error nodes in our circuit completely destroys the model’s performance, we found that this effect can be completely mitigated by restoring a narrow group of late-mid SAE features.
We think that one hypothesis that explains this (and other ablation-based experiments that we performed) is that SAE errors might contain intermediate feature representations from cross-layer superposition.
To investigate it beyond ablation-restoration experiments, we tried to apply crosscoder analysis but got stuck at the point of training an acausal crosscoder; instead we propose a specific MVE

... (read 11300 more words →)

Replying toDo Sparse Autoencoders (SAEs) transfer across base and finetuned language models?

Taras Kutsyk1y

Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?

Thanks for the insight! I expect the same to hold though for Gemma 2B base (pre-trained) vs Gemma 2B Instruct models? Gemma-2b-Python-codes is just a full finetune on top of the Instruct model (probably produced without a large number of update steps), and previous work that studied Instruct models indicated that SAEs don't transfer to the Instruct Gemma 2B either.

Replying toDo Sparse Autoencoders (SAEs) transfer across base and finetuned language models?

Taras Kutsyk1y

Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?

Thanks! We'll take a closer look at these when we decide to extend our results for more models.

Replying toDo Sparse Autoencoders (SAEs) transfer across base and finetuned language models?

Taras Kutsyk1y*

Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?

Let me make sure I understand your idea correctly:

We use a separate single-layer model (analogous to the SAE encoder) to predict the SAE feature activations
We train this model on the SAE activations of the finetuned model (assuming that the SAE wasn't finetuned on the finetuned model activations?)
We then use this model to determine "what direction most closely maps to the activation pattern across input sequences, and how well it maps".

I'm most unsure about the 2nd step - how we train this feature-activation model. If we train it on the base SAE activations in the finetuned model, I'm afraid we'll just train it on extremely noisy data, because feature activations essentially do not mean the same thing, unless your SAE has been finetuned to appropriately reconstruct the finetuned model activations. (And if we finetune it, we might just as well use the SAE and feature-universality techniques I outlined without needing a separate model).

Replying toDo Sparse Autoencoders (SAEs) transfer across base and finetuned language models?

Taras Kutsyk1y*

Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?

I like the idea of seeing if there are any features from the base model which are dead in the instruction-tuned-and-fine-tuned model as a proxy for "are there any features which fine-tuning causes the fine-tuned model to become unable to recognize"

Agreed, but I think our current setup is too limited to capture this. If we’re using the same “base SAE” for both the base and finetuned models, the situation like the one you described really implies “now this feature from the base model has a different vector (direction) in the activation space OR this feature is no longer recognizable”. Without training another SAE on the finetuned model, we have no way to... (read 356 more words →)

Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?

Taras Kutsyk

Taras Kutsyk, Tommaso Mencattini, Ciprian Florea

This is a project submission post for the AI Safety Fundamentals course from BlueDot Impact. Therefore, some of its sections are intended to be beginner-friendly and overly verbose for familiar readers (mainly the Introduction section) and may freely be skipped.

TLDR (Executive Summary)

We explored whether Sparse Autoencoders (SAEs) can effectively transfer from base language models to their finetuned counterparts, focusing on two base models: Gemma-2b and Mistral-7B-V0.1 (we tested finetuned versions for coding and mathematics respectively)
In particular, we split our analysis into three steps:
1. We analysed the similarity (Cosine and Euclidian Distance) of the residual activations, which was highly correlated with the resulting transferability of the SAEs for the two models.
2. We computed several performance

... (read 7236 more words →)