It's essentially training an SAE on the concatenation of the residual stream from the base model and the chat model. So, for each prompt, you run it through the base model to get a residual stream vector v_b, through the chat model to get a residual stream vector v_c, and then concatenate these to get a vector twice as long, and train an SAE on this (with some minor additional details that I'm not getting into)

Reply

Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)

Neel Nanda1moΩ230

This is somewhat similar to the approach of the ROME paper, which has been shown to not actually do fact editing, just inserting louder facts that drown out the old ones and maybe suppressing the old ones.

In general, the problem with optimising model behavior as a localisation technique is that you can't distinguish between something that truly edits the fact, and something which adds a new fact in another layer that cancels out the first fact and adds something new.

Reply

How I got 4.2M YouTube views without making a single video

Neel Nanda1mo30

Agreed, chance of success when cold emailing busy people is low, and spamming them is bad. And there are alternate approaches that may work better, depending on the person and their setup - some Youtubers don't have a manager or employees, some do. I also think being able to begin an email with "Hi, I run the DeepMind mechanistic interpretability team" was quite helpful here.

Reply

Mark Xu's Shortform

Neel Nanda1moΩ220

The high level claim seems pretty true to me. Come to the GDM alignment team, it's great over here! It seems quite important to me that all AGI labs have good safety teams

Thanks for writing the post!

Reply

MichaelDickens's Shortform

Neel Nanda2mo110

Huh, are there examples of right leaning stuff they stopped funding? That's new to me

Reply

Nathan Young's Shortform

Neel Nanda2mo72

+1. Concretely this means converting every probability p into p/(1-p), and then multiplying those (you can then convert back to probabilities)

Intuition pump: Person A says 0.1 and Person B says 0.9. This is symmetric, if we instead study the negation, they swap places, so any reasonable aggregation should give 0.5

Geometric mean does not, instead you get 0.3

Arithmetic gets 0.5, but is bad for the other reasons you noted

Geometric mean of odds is sqrt(1/9 * 9) = 1, which maps to a probability of 0.5, while also eg treating low probabilities fairly

Reply

1

Showing SAE Latents Are Not Atomic Using Meta-SAEs

Neel Nanda2moΩ570

Interesting thought! I expect there's systematic differences, though it's not quite obvious how. Your example seems pretty plausible to me. Meta SAEs are also more incentived to learn features which tend to split a lot, I think, as then they're useful for more predicting many latents. Though ones that don't split may be useful as they entirely explain a latent that's otherwise hard to explain.

Anyway, we haven't checked yet, but I expect many of the results in this post would look similar for eg sparse linear regression over a smaller SAEs decoder. Re why meta SAEs are interesting at all, they're much cheaper to train than a smaller SAE, and BatchTopK gives you more control over the L0 than you could easily get with sparse linear regression, which are some mild advantages, but you may have a small SAE lying around anyway. I see the interesting point of this post more as "SAE latents are not atomic, as shown by one method, but probably other methods would work well too"

Reply

quetzal_rainbow's Shortform

Neel Nanda2mo40

What's wrong with twitter as an archival source? You can't edit tweets (technically you can edit top level tweets for up to an hour, but this creates a new URL and old links still show the original version). Seems fine to just aesthetically dislike twitter though

Reply

Why I'm bearish on mechanistic interpretability: the shards are not in the network

Neel Nanda2mo113

To me, this model predicts that sparse autoencoders should not find abstract features, because those are shards, and should not be localisable to a direction in activation space on a single token. Do you agree that this is implied?

If so, how do you square that with eg all the abstract features Anthropic found in Sonnet 3?

Reply

Contra papers claiming superhuman AI forecasting

Neel Nanda2mo30

Thanks for making the correction!

Reply