Interestingly a similar tool run by a very smart guy (https://www.creatorml.com/) is apparently about to shut down, so it might not be a financially sustainable thing to try.

Reply

Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers

Louka Ewington-Pitsos5mo10

I couldn't find a link to the code in the article so in case anyone else wants to try to replicate I think this is it: https://github.com/HugoFry/mats_sae_training_for_ViTs

Reply

Efficient Dictionary Learning with Switch Sparse Autoencoders

Louka Ewington-Pitsos6mo10

Just to close the loop on this one, the official huggingface transformers library just uses a for-loop to achieve MoE. I also implemented a version myself using a for loop and it's much more efficient than either vanilla matrix multiplication or that weird batch matmul I write up there for large latent and batch sizes.

Reply

Efficient Dictionary Learning with Switch Sparse Autoencoders

Louka Ewington-Pitsos6mo10

wait a minute... could you just...

you don't just literally do this do you?

input = torch.tensor([
    [1, 2],
    [1, 2],
    [1, 2],
]) # (bs, input_dim)


enc_expert_1 = torch.tensor([
    [1, 1, 1, 1],
    [1, 1, 1, 1],

])
enc_expert_2 = torch.tensor([
    [3, 3, 0, 0],
    [0, 0, 2, 0],
])



dec_expert_1 = torch.tensor([
    [ -1, -1],
    [ -1, -1],
    [ -1, -1],
    [ -1, -1],
])

dec_expert_2 = torch.tensor([
    [-10, -10,],
    [-10, -10,],
    [-10, -10,],
    [-10, -10,],

])

def moe(input, enc, dec, nonlinearity):
    input = input.unsqueeze(1)
    latent = torch.bmm(input, enc)

    recon = torch.bmm(nonlinearity(latent, dec))

    return recon.squeeze(1), latent.squeeze(1)


# not this! some kind of actual routing algorithm, but you end up with something similar
enc = torch.stack([enc_expert_1, enc_expert_2, enc_expert_1])
dec = torch.stack([dec_expert_1, dec_expert_2, dec_expert_1])

nonlinearity = torch.nn.ReLU()
recons, latent = moe(input, enc, dec, nonlinearity)

This must in some way be horrifically inefficient, right?

Reply

Efficient Dictionary Learning with Switch Sparse Autoencoders

Louka Ewington-Pitsos6mo10

Can I ask what you used to implement the MOE routing? Did you use megablocks? I would love to expand on this research but I can't find any straightforward implementation of efficient pytorch MOE routing online.

Do you simply iterate over each max probability expert every time you feed in a batch?

Reply

Research Report: Alternative sparsity methods for sparse autoencoders with OthelloGPT.

Louka Ewington-Pitsos8mo10

This is dope, thank you for your service. Also, can you hit us with your code on this one? Would love to reproduce.

Reply