All of tdooms's Comments + Replies

tdooms40

We haven't considered this since our idea was that the encoder could maybe use the full information to better predict features. However, this seems worthwhile to at least try. I'll look into this soon, thanks for the inspiration.

tdooms20

This is a completely fair suggestion. I'll look into training a fully-fledged SAE with the same number of features for the full training duration. 

tdooms80

One caveat that I want to highlight is that there was a bug training the tokenized SAEs for the expansions sweep, the lookup table isn't learned but remained at the hard-coded values...

They are therefore quite suboptimal. Due to some compute constraints, I haven't re-run that experiment (the x64 SAEs take quite a while to train).

Anyway, I think the main question you want answered is if the 8x tokenized SAE beats the 64x normal SAE, which it does. The 64x SAE is improving slightly quicker near the end of training, I only used 130M tokens.

Below is an NMSE pl... (read more)

3Logan Riggs
That's great thanks!  My suggested experiment to really get at this question (which if I were in your shoes, I wouldn't want to run cause you've already done quite a bit of work on this project!, lol): Compare  1. Baseline 80x expansion (56k features) at k=30 2. Tokenized-learned 8x expansion (50k vocab + 6k features) at k=29 (since the token adds 1 extra feature) for 300M tokens (I usually don't see improvements past this amount) showing NMSE and CE. If tokenized-SAEs are still better in this experiment, then that's a pretty solid argument to use these! If they're equivalent, then tokenized-SAEs are still way faster to train in this lower expansion range, while having 50k "features" already interpreted. If tokenized-SAEs are worse, then these tokenized features aren't a good prior to use. Although both sets of features are learned, the difference would be the tokenized always has the same feature per token (duh), and baseline SAEs allow whatever combination of features (e.g. features shared across different tokens).
tdooms20

That's awesome to hear, while we are not especially familiar with circuit analysis, anecdotally, we've heard that some circuit features are very disappointing (such as the "Mary" feature for IOI, I believe this is also the case in Othello SAEs where many features just describe the last move). This was a partial motivation for this work.

About similar tokenized features, maybe I'm misunderstanding, but this seems like a problem for any decoder-like structure. In the lookup table though, I think this behaviour is somewhat attenuated due to the strict manual trigger, which encourages the lookup table to learn exact features instead of means.

3Logan Riggs
I didn't mean to imply it's a problem, but the intepretation should be different. For example, if at layer N, all the number tokens have cos-sim=1 in the tokenized-feature set, then if we find a downstream feature reading from " 9" token on a specific task, then we should conclude it's reading from a more general number direction than a specific number direction.  I agree this argument also applies to the normal SAE decoder (if the cos-sim=1)
tdooms40

We used a google drive repo where we stored most of the runs (https://drive.google.com/drive/folders/1ERSkdA_yxr7ky6AItzyst-tCtfUPy66j?usp=sharing). We use a somewhat weird naming scheme, if there is a "t" in the postfix of the name, it is tokenized. Some may be old and may not fully work, if you run into any issues, feel free to reach out.

The code in the research repo (specifically https://github.com/tdooms/tokenized-sae/blob/main/base.py#L119) should work to load them in.

Please keep in mind that these are currently more of a proof of concept and are like... (read more)

1danwil
Additionally: * We recommend using the gpt2-layers directory, which includes resid_pre layers 5-11, topk=30, 12288 features (the tokenized "t" ones have learned lookup tables, pre-initialized with unigram residuals). * The folders pareto-sweep, init-sweep, and expansion-sweep contain parameter sweeps, with lookup tables fixed to 2x unigram residuals. In addition to the code repo linked above, for now here is some quick code that loads the SAE, exposes the lookup table, and computes activations only: import torch SAE_BASE_PATH = 'gpt2-layers/gpt2_resid_pre' # hook_pt = 'resid_pre' layer = 8 state_dict = torch.load(f'{SAE_BASE_PATH}_{layer}_ot30.pt') # o=topk, t=tokenized, k=30 W_lookup = state_dict['lookup.W_lookup.weight'] # per-token decoder biases def tokenized_activations(state_dict, x): 'Return pre-relu activations. For simplicity, does not limit to topk.' return (x - state_dict['b_dec']) @ state_dict['W_enc'].T + state_dict['b_enc']
tdooms*40

Tokens are indeed only a specific instantiation of hardcoding "known" features into an SAE, there are lots of interesting sparse features one can consider which may even further speed up training.

I like the suggestion of trying to find the "enriched" token representations. While our work shows that such representations are likely bigrams and trigrams, using an extremely sparse SAE to reveal those could also work (say at layer 1 or 2). While this approach still has the drawback of having an encoder, this encoder can be shared across SAEs, which is still a l... (read more)