Jack Foxabbott

Posts

Sorted by New

Wiki Contributions

Comments

Sorted by

Can quantised autoencoders find and interpret circuits in language models?

Hey, really enjoyed this post, thanks! Did you consider using a binary codebook, i.e. a set of vectors [b_0, ..., b_k] where b_i is binary? This gives the latent space more structure and may endow each dimension of the codes with its own meaning, so we can get away with interpreting dimensions rather than full codes. I'm thinking more in line with how SAE latent vars are interpreted. You note in the post:

There's notoriously a lot of tricks involved in training a VQ-VAE. For instance:
Using a lower codebook dimension
normalising the codes and the encoded vectors (this paper claims that forcing the vectors to be on a hypersphere improves code usage)
Expiring stale codes
Forcing the codebook to be orthogonal, meaning translation equivariance of the codes
Various different additional losses

Do you think this should intrinsically make it hard to train a binary version? On some toy experiments with synthetic data i'm finding the codebook underutilised. (i've now realised FSQ may solve this problem)

Reply