CLIP-Scope?

Inspired by Gemma-Scope We trained 8 Sparse Autoencoders each on 1.2 billion tokens on different layers of a Vision Transformer. These (along with more technical details) can be accessed on huggingface. We also released a pypi package for ease of use.

We hope that this will allow researchers to experiment with vision transformer interpretability without having to go through the drudgery of training sparse auto encoders from scratch.

the top 9 activating images in LAION for feature 392 of the SA trained on layer 20

Vital Statistics

Model: laion/CLIP-ViT-L-14-laion2B-s32B-b82K
Number of tokens trained per autoencoder: 1.2 Billion
Token type: all 257 image tokens (as opposed to just the CLS token)
Number of unique images trained per autoencoder: 4.5 Million
Training

... (read 221 more words →)

Replying toFood, Prison & Exotic Animals: Sparse Autoencoders Detect 6.5x Performing Youtube Thumbnails

Louka Ewington-Pitsos1y

Food, Prison & Exotic Animals: Sparse Autoencoders Detect 6.5x Performing Youtube Thumbnails

It seems like it's possible right?

Interestingly a similar tool run by a very smart guy (https://www.creatorml.com/) is apparently about to shut down, so it might not be a financially sustainable thing to try.

Food, Prison & Exotic Animals: Sparse Autoencoders Detect 6.5x Performing Youtube Thumbnails

Louka Ewington-Pitsos

TL;DR

A Sparse Autoencoder trained on CLIP activations detects features in youtube thumbnails associated with high view counts across a dataset of 524,288 thumbnails from youtube-8m. Most of these features are interpretable and monosemantic. The most strongly associated feature has a median view count > 6.5x the dataset median. After simulating random features and comparing their median views we conclude that the probability that these observations could have occurred at random is zero or near zero.

These features may correspond to visual elements which attract the attention of youtube users, but they may instead correlate to some other, less exciting structure related with higher views such as date of upload.

Similar analysis is conducted on... (read 2041 more words →)

Replying toTowards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers

Louka Ewington-Pitsos1y

Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers

I couldn't find a link to the code in the article so in case anyone else wants to try to replicate I think this is it: https://github.com/HugoFry/mats_sae_training_for_ViTs

Massive Activations and why <bos> is important in Tokenized SAE Unigrams

Louka Ewington-Pitsos

This article is a footnote to the (excellent) Tokenized SAEs. To briefly summarize: the authors (Thomas Dooms and Daniel Wilhelm) start by proposing the idea of a token "unigram".

Let Ut be the unigram for a specific token t. Conceptually there is a unique Ut for each layer Mi in each model M, such that Ut(M, i) are the activations coming out of Mi when M is fed just t by itself. More abstractly Ut are the activations which correspond to just t taken in isolation, un-bound by any particular context.

The authors then demonstrated that the activation at position j in any token sequence will have a high cosine similarity to Utj(M, i),... (read 755 more words →)

Training a Sparse Autoencoder in < 30 minutes on 16GB of VRAM using an S3 cache

Louka Ewington-Pitsos

Summary

300 million GPT2-small activations are cached on s3, we pull these very quickly onto a g4dn.8xlarge ec2 instance in the same region and use them to train a 24,576 dimensional Switch Sparse Autocoder in 26 minutes (excluding generation time). We achieve similar L0/reconstruction loss to Gao et al. and a low dead feature proportion. Typically this process takes in the realm of 3-4 hours and far more GPU power, for instance here and here. Code to replicate is made public and so are the weights and biases logs.

Background

There are many excellent resources explaining Sparse Autoencoders and how to train them. The ones linked below do a far better job of... (read 1249 more words →)

Replying toEfficient Dictionary Learning with Switch Sparse Autoencoders

Louka Ewington-Pitsos2y

Efficient Dictionary Learning with Switch Sparse Autoencoders

Just to close the loop on this one, the official huggingface transformers library just uses a for-loop to achieve MoE. I also implemented a version myself using a for loop and it's much more efficient than either vanilla matrix multiplication or that weird batch matmul I write up there for large latent and batch sizes.

Replying toEfficient Dictionary Learning with Switch Sparse Autoencoders

Louka Ewington-Pitsos2y

Efficient Dictionary Learning with Switch Sparse Autoencoders

wait a minute... could you just...

you don't just literally do this do you?

input = torch.tensor([
    [1, 2],
    [1, 2],
    [1, 2],
]) # (bs, input_dim)


enc_expert_1 = torch.tensor([
    [1, 1, 1, 1],
    [1, 1, 1, 1],

])
enc_expert_2 = torch.tensor([
    [3, 3, 0, 0],
    [0, 0, 2, 0],
])



dec_expert_1 = torch.tensor([
    [ -1, -1],
    [ -1, -1],
    [ -1, -1],
    [ -1, -1],
])

dec_expert_2 = torch.tensor([
    [-10, -10,],
    [-10, -10,],
    [-10, -10,],
    [-10, -10,],

])

def moe(input, enc, dec, nonlinearity):
    input = input.unsqueeze(1)
    latent = torch.bmm(input, enc)

    recon = torch.bmm(nonlinearity(latent, dec))

    return recon.squeeze(1), latent.squeeze(1)


# not this! some kind of actual routing algorithm, but you end up with something similar
enc = torch.stack([enc_expert_1, enc_expert_2, enc_expert_1])
dec = torch.stack([dec_expert_1, dec_expert_2, dec_expert_1])

nonlinearity = torch.nn.ReLU()
recons, latent = moe(input, enc, dec, nonlinearity)

This must in some way be horrifically inefficient, right?

Replying toEfficient Dictionary Learning with Switch Sparse Autoencoders

Louka Ewington-Pitsos2y

Efficient Dictionary Learning with Switch Sparse Autoencoders

Can I ask what you used to implement the MOE routing? Did you use megablocks? I would love to expand on this research but I can't find any straightforward implementation of efficient pytorch MOE routing online.

Do you simply iterate over each max probability expert every time you feed in a batch?

Faithful vs Interpretable Sparse Autoencoder Evals

Louka Ewington-Pitsos

Summary

Sparse Autoencoders (SAEs) have become a popular tool in Mechanistic Interpretability but at the time of writing there is no standardm automated way to accurately assess the quality of a SAE. In practice many different metrics are applied and interpreted to assess SAE quality in different studies.

Conceptually however what makes a good SAE is relatively uncontroversial: its latent features must reflect the structure of the model it was trained on whilst also reflecting the structure of human linguistic understanding^[1]^[2]^[3]. These twin objectives are usually referred to as "Faithfulness" and "Interpretability"^[1]^[4].

In this article we will go through 8 common metrics used to evaluate the quality of a SAE and try to demonstrate that, though... (read 3362 more words →)

Replying toResearch Report: Alternative sparsity methods for sparse autoencoders with OthelloGPT.

Louka Ewington-Pitsos2y

Research Report: Alternative sparsity methods for sparse autoencoders with OthelloGPT.

This is dope, thank you for your service. Also, can you hit us with your code on this one? Would love to reproduce.

LESSWRONG
LW

LESSWRONG
LW

Louka Ewington-Pitsos

A suite of Vision Sparse Autoencoders

Training a Sparse Autoencoder in < 30 minutes on 16GB of VRAM using an S3 cache

Food, Prison & Exotic Animals: Sparse Autoencoders Detect 6.5x Performing Youtube Thumbnails

Faithful vs Interpretable Sparse Autoencoder Evals

Louka Ewington-Pitsos

A suite of Vision Sparse Autoencoders

Food, Prison & Exotic Animals: Sparse Autoencoders Detect 6.5x Performing Youtube Thumbnails

Massive Activations and why <bos> is important in Tokenized SAE Unigrams

Training a Sparse Autoencoder in < 30 minutes on 16GB of VRAM using an S3 cache

Faithful vs Interpretable Sparse Autoencoder Evals

Louka Ewington-Pitsos

A suite of Vision Sparse Autoencoders

Training a Sparse Autoencoder in < 30 minutes on 16GB of VRAM using an S3 cache

Food, Prison & Exotic Animals: Sparse Autoencoders Detect 6.5x Performing Youtube Thumbnails

Faithful vs Interpretable Sparse Autoencoder Evals

Louka Ewington-Pitsos

A suite of Vision Sparse Autoencoders

Food, Prison & Exotic Animals: Sparse Autoencoders Detect 6.5x Performing Youtube Thumbnails

Massive Activations and why <bos> is important in Tokenized SAE Unigrams

Training a Sparse Autoencoder in < 30 minutes on 16GB of VRAM using an S3 cache

Faithful vs Interpretable Sparse Autoencoder Evals

CLIP-Scope?

Vital Statistics

TL;DR

Summary

Background

Summary