Logan Riggs

Wiki Contributions

Comments

Sorted by

It'd be important to cache the karma of all users > 1000 atm, in order to credibly signal you know which generals were part of the nuking/nuked side. Would anyone be willing to do that in the next 2 & 1/2 hours? (ie the earliest we could be nuked)

We could instead  pre-commit to not engage with any nuker's future posts/comments (and at worse comment to encourage others to not engage) until end-of-year.

Or only include nit-picking comments.

Could you dig into why you think it's great inter work?

But through gradient descent, shards act upon the neural networks by leaving imprints of themselves, and these imprints have no reason to be concentrated in any one spot of the network (whether activation-space or weight-space). So studying weights and activations is pretty doomed.

This paragraph sounded like you're claiming LLMs do have concepts, but they're not in specific activations or weights, but distributed across them instead. 

But from your comment, you mean that LLMs themselves don't learn the true simple-compressed features of reality, but a mere shadow of them.

This interpretation also matches the title better!

But are you saying the "true features" in the dataset + network? Because SAEs are trained on a dataset! (ignoring the problem pointed out in footnote 1).

Possibly clustering the data points by their network gradients would be a way to put some order into this mess? 

Eric Michaud did cluster datapoints by their gradients here. From the abstract:

...Using language model gradients, we automatically decompose model behavior into a diverse set of skills (quanta). 

The one we checked last year was just Pythia-70M, which I don't expect the LLM itself to have a gender feature that generalizes to both pronouns and anisogamy.

But again, the task is next-token prediction. Do you expect e.g. GPT 4 to have learned a gender concept that affects both knowledge about anisogamy and pronouns while trained on next-token prediction?

Sparse autoencoders finds features that correspond to abstract features of words and text. That's not the same as finding features that correspond to reality.

(Base-model) LLMs are trained to minimize prediction error, and SAEs do seem to find features that sparsely predict error, such as a gender feature that, when removed, affects the probability of pronouns. So pragmatically, for the goal of "finding features that explain next-word-prediction", which LLMs are directly trained for, SAEs find good examples![1]

I'm unsure what goal you have in mind for "features that correspond to reality", or what that'd mean.

  1. ^

    Not claiming that all SAE latents are good in this way though.

Logan RiggsΩ120

Is there code available for this?

I'm mainly interested in the loss fuction. Specifically from footnote 4:

We also need to add a term to capture the interaction effect between the key-features and the query-transcoder bias, but we omit this for simplicity

I'm unsure how this is implemented or the motivation.


 

Logan RiggsΩ120

Some MLPs or attention layers may implement a simple linear transformation in addition to actual computation.

@Lucius Bushnaq , why would MLPs compute linear transformations? 

Because two linear transformations can be combined into one linear transformation, why wouldn't downstream MLPs/Attns that rely on this linearly transformed vector just learn the combined function? 

Logan RiggsΩ120

What is the activation name for the resid SAEs? hook_resid_post or hook_resid_pre?

I found https://github.com/ApolloResearch/e2e_sae/blob/main/e2e_sae/scripts/train_tlens_saes/run_train_tlens_saes.py#L220
to suggest _post
but downloading the SAETransformer from wandb shows:
(saes): 
    ModuleDict( (blocks-6-hook_resid_pre): 
        SAE( (encoder): Sequential( (0):...

which suggests _pre. 
 

Load More