Tommaso Mencattini

Message

Master Students at EPFL (Data Science)

Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?

This is a project submission post for the AI Safety Fundamentals course from BlueDot Impact. Therefore, some of its sections are intended to be beginner-friendly and overly verbose for familiar readers (mainly the Introduction section) and may freely be skipped. TLDR (Executive Summary) * We explored whether Sparse Autoencoders (SAEs)...

Sep 29, 2024•28

Message

Master Students at EPFL (Data Science)

24 karma

Member for a year

Tommaso Mencattini — LessWrong

Tommaso Mencattini

Message

Master Students at EPFL (Data Science)

Tommaso Mencattini

Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?

Sep 29, 2024•28

Message

Master Students at EPFL (Data Science)

24 karma

Member for a year

Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?

Taras Kutsyk

Taras Kutsyk, Tommaso Mencattini, Ciprian Florea+ 0 more

Taras Kutsyk, Tommaso Mencattini, Ciprian Florea

This is a project submission post for the AI Safety Fundamentals course from BlueDot Impact. Therefore, some of its sections are intended to be beginner-friendly and overly verbose for familiar readers (mainly the Introduction section) and may freely be skipped.

TLDR (Executive Summary)

We explored whether Sparse Autoencoders (SAEs) can effectively transfer from base language models to their finetuned counterparts, focusing on two base models: Gemma-2b and Mistral-7B-V0.1 (we tested finetuned versions for coding and mathematics respectively)
In particular, we split our analysis into three steps:
1. We analysed the similarity (Cosine and Euclidian Distance) of the residual activations, which was highly correlated with the resulting transferability of the SAEs for the two models.
2. We computed several performance

... (read 7236 more words →)

LESSWRONG
LW

LESSWRONG
LW

Tommaso Mencattini

Tommaso Mencattini

Tommaso Mencattini

Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?

Tommaso Mencattini

Tommaso Mencattini

Tommaso Mencattini

Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?

TLDR (Executive Summary)