LESSWRONG
LW

All of StefanHex's Comments + Replies

How to use and interpret activation patching

Imagine a circuit

A --> B --> C
D --> E --> F
C + F --> output

Denoising (restoring) B + E is sufficient to restore behaviour, but B + E are just a "cross-section" (cutting through the circuit at the 2nd layer), not a superset of all components

Try training token-level probes

StefanHex17d20

Thanks for the link, I hadn't noticed this paper! They show that when you choose one position to train the probes on, choosing the exact answer position (last token of the answer of multi-token) gives the strongest probe.

After reading the section I think they (unfortunately) do not train a probe to classify every token.^[1] Instead the probe is exclusively trained on exact-answer tokens. Thus I (a) expect their probe scores will not be particularly sparse, and (b) to get good performance you'll probably need to still identify the exact answer token at ... (read more)

Try training token-level probes

StefanHex18d20

Thanks! Fixed

Enumerating objects a model "knows" using entity-detection features.

StefanHex1mo20

I like this project! One thing I particularly like about it is that it extracts information from the model without access to the dataset (well, if you ignore the SAE part -- presumably one could have done the same by finding the "known entity" direction with a probe?). It has been a long-time goal of mine to do interpretability (in the past that was extracting features) without risking extracting properties of the dataset used (in the past: clusters/statistics of the SAE training dataset).

I wonder if you could turn this into a thing we can do with interp t... (read more)

3Alex Gibson1mo

I'm glad you like it! Yeah the lack of a dataset is the thing that excites me about this kind of approach, because it allows us to get validation of our mechanistic explanations via partial "dataset recovery", which I find to be really compelling. It's a lot slower going, and may only work out for the first few layers, but it makes for a rewarding loop. The utility of SAEs is in telling us in an unsupervised way that there is a feature that codes for "known entity", but this project doesn't use SAEs explicitly. I look for sparse sets of neurons that activate highly on "known entities". Neel Nanda / Wes Gurnee's sparse probing work is the inspiration here: https://arxiv.org/abs/2305.01610 But we only know to look for this sparse set of neurons because the SAEs told us the "known entity" feature exists, and it's only because we know this feature exists that we expect neurons identified on a small set of entities (I think I looked at <5 examples and identified Neuron 0.2946, but admittedly kinda cheated by double checking on neuronpedia) to generalize. If you count linear probing as a non-interp strategy, you could find the linear direction associated with "entity detection", and then just run the model over all 50257^2 possible pairs of input tokens. The mech interp approach still has to deal with 50257^2 pairs of inputs, but we can use our circuit analysis to save significant time by avoiding the model overhead, meaning we get the list of bigrams pretty much instantly. The circuit analysis also tells us we have only have to look at the previous 2 tokens for determining the broad component of the "entity detection" direction, which we might not know a priori. But I wouldn't say this is a project only interp can do, just maybe interp speeds it up significantly. [Note the reason we need 50257^2 inputs even in the mechanistic approach is because I don't know of a good method for extracting the sparse set of large EQKE entries without computing the whole matrix. If w

An idea for avoiding neuralese architectures

StefanHex1mo70

Thanks for thinking about this, I think this is an important topic!

Inside the AI's chain-of-thought, each forward pass can generate many English tokens instead of one, allowing more information to pass through the bottleneck.

I wonder how one would do this; do you mean allow the model to output a distribution of tokens for each output position? (and then also read-in that distribution) I could imagine this being somewhere between normal CoT and latent (neuralese) CoT!

After the chain-of-thought ends, and the AI is giving its final answer, it generates only o

... (read more)

1Knight Lee1mo

Haha, my idea was just "maybe to solve this information bottleneck problem, we should solve the generate-many-tokens-in-one-pass problem." I haven't really thought of any solution to the generate-many-tokens-in-one-pass problem yet :/ I'll edit the post to mention this. An attempt One stupid attempt to solve the "generate-many-tokens-in-one-pass" problem, is to start off with the main LLM outputting 1 token at a time, and a small cheap LLM outputting the next 5 tokens. You then let the small LLM eavesdrop on the residual stream of the main LLM, and use reinforcement learning on both the main LLM and the small LLM. The hope is that the main LLM will eventually learn to use part of its residual stream to communicate to the small LLM, and tell the small LLM what the next 5 tokens should be, so the computations in the main LLM can directly influence 6 tokens of output. Slow thinking I guess the "simple filter repeatedly deletes everything except its first token from the context window" was a bit unclear. I'll rewrite that. What I wanted to say was, when the AI is talking to you (rather than talking to itself in its chain-of-thought), we want the AI to slow down, and do more computation for each token it outputs. In this case, we don't want it outputting many tokens for each forward pass. We want to only keep the first "high quality" token and delete the rest. I don't think this is related to top-k/top-p generation, because that's referring to how an LLM samples one token from its distribution. k refers to the number of tokens considered not the number of tokens generated at once. Thank you so much for reading and for the reply :)

Downstream applications as validation of interpretability progress

StefanHex1mo20

This is great advice! I appreciate that you emphasised "solving problems that no one else can solve, no matter how toy they might be", even if the problems are not real-world problems. Proofs that "this interpretability method works" are valuable, even if they do not (yet) prove that the interpretability method will be useful in real-word tasks.

StefanHex's Shortform

StefanHex1mo*130

LLM activation space is spiky. This is not a novel idea but something I believe many mechanistic interpretability researchers are not aware of. Credit to Dmitry Vaintrob for making this idea clear to me, and to Dmitrii Krasheninnikov for inspiring this plot by showing me a similar plot in a setup with categorical features.

Under the superposition hypothesis, activations are linear combinations of a small number of features. This means there are discrete subspaces in activation space that are "allowed" (can be written as the sum of a small number of features... (read more)

StefanHex's Shortform

StefanHex1mo50

we never substantially disrupt or change the deep-linking experience.

I largely retract my criticism based on this. I had thought it affected deep-links more than it does. ^[1]

I initially noticed April Fools' day after following a deep-link. I thought I had seen the font of the username all wacky (kind-of pixelated?), and thus was more annoyed. But I can't seem to reproduce this now and conclude it was likely not real. Might have been a coincidence / unrelated site-loading bug / something temporarily broken on my end. ↩︎

3habryka1mo

You are not imagining things! When we deployed things this morning/late last night I had a pixel-art theme deployed by default across the site, but then after around an hour decided it was indeed too disruptive to the reading experience and reverted it. Seems like we are both on roughly the same page on what is too much.

StefanHex's Shortform

StefanHex1mo*41

Edit: I feel less strongly following the clarification below. habryka clarified that (a) they reverted a more disruptive version (pixel art deployed across the site) and (b) that ensuring minimal disruption on deep-links is a priority.

I'm not a fan of April Fools' events on LessWrong since it turned into the de-factor AI safety publication platform.

We want people to post serious research on the site, and many research results are solely hosted on LessWrong. For instance, this mech interp review has 22 references pointing to lesswrong.com (along with 22 fur... (read more)

habryka1mo122

It's always been a core part of LessWrong April Fool's that we never substantially disrupt or change the deep-linking experience.

So while it looks like a lot of going on today, if you get linked directly to an article, you will basically notice nothing different. All you will see today are two tiny pixel-art icons in the header, nothing else. There are a few slightly noisy icons in the comment sections, but I don't think people would mind that much.

This has been a core tenet of all April Fool's in the past. The frontpage is fair game, and April Fool'... (read more)

Do models say what they learn?

StefanHex1mo30

Nice work, and well written up!

In reality, we observe that roughly 85% of recommendations stay the same when flipping nationality in the prompt and freezing reasoning traces. This suggests that the mechanism for the model deciding on its recommendation is mostly mediated through the reasoning trace, with a smaller less significant direct effect from the prompt to the recommendation.

The "reasoning" appears to end with a recommendation "The applicant may have difficulty making consistent loan payments" or "[the applicant is] likely to repay the loan on time"... (read more)

For scheming, we should first focus on detection and then on prevention

StefanHex1mo20

I can see an argument for "outer alignment is also important, e.g. to avoid failure via sycophancy++", but this doesn't seem to disagree with this post? (I understand the post to argue what you should do about scheming, rather than whether scheming is the focus.)

Having good outer alignment incidentally prevents a lot of scheming. But the reverse isn't nearly as true.

I don't understand why this is true (I don't claim the reverse is true either). I don't expect a great deal of correlation / implication here.

2Charlie Steiner1mo

The second thing impacts the first thing :) If a lot of scheming is due to poor reward structure, and we should work on better reward structure, then we should work on scheming prevention.

StefanHex's Shortform

StefanHex2mo20

Yeah you probably shouldn't concat the spaces due to things like "they might have very different norms & baseline variances". Maybe calculate each layer separately, then if they're all similar average them together, otherwise keep separate and quote as separate numbers in your results

StefanHex's Shortform

StefanHex2mo30

Yep, that’s the generalisation that would make most sense

1Oliver Clive-Griffin2mo

Thanks. Also, in the case of crosscoders, where you have multiple output spaces, do you have any thoughts on the best way to aggregate across these? currently I'm just computing them separately and taking the mean. But I could see imagine it perhaps being better to just concat the spaces and do fvu on that, using l2 norm of the concated vectors.

StefanHex's Shortform

StefanHex3mo20

The previous lines calculate the ratio (or 1-ratio) stored in the “explained variance” key for every sample/batch. Then in that later quoted line, the list is averaged, I.e. we”re taking the sample average over the ratio. That’s the FVU_B formula.

Let me know if this clears it up or if we’re misunderstanding each other!

StefanHex's Shortform

StefanHex3mo20

I think this is the sum over the vector dimension, but not over the samples. The sum (mean) over samples is taken later in this line which happens after the division

        metrics[f"{metric_name}"] = torch.cat(metric_values).mean().item()

1Archimedes3mo

Let's suppose that's the case. I'm still not clear on how are you getting to FVU_B?

StefanHex's Shortform

StefanHex3mo20

Oops, fixed!

StefanHex's Shortform

StefanHex3mo21

I think this is the sum over the vector dimension, but not over the samples. The sum (mean) over samples is taken later in this line which happens after the division

        metrics[f"{metric_name}"] = torch.cat(metric_values).mean().item()

Edit: And to clarify, my impression is that people think of this as alternative definitions of FVU and you got to pick one, rather than one being right and one being a bug.

Edit2: And I'm in touch with the SAEBench authors about making a PR to change this / add both options (and by extension probably doing the same in S... (read more)

4Gurkenglas3mo

Ah, oops. I think I got confused by the absence of L_2 syntax in your formula for FVU_B. (I agree that FVU_A is more principled ^^.)

StefanHex's Shortform

StefanHex3mo*294

PSA: People use different definitions of "explained variance" / "fraction of variance unexplained" (FVU)

{F V U}_{A} = \frac{\frac{1}{N} \sum_{n = 1}^{N} ∥ x_{n} - x_{n, p r e d} ∥^{2}}{\frac{1}{N} \sum_{n = 1}^{N} ∥ x_{n} - μ ∥^{2}} where μ = \frac{1}{N} N \sum n = 1 x_{n}

${F V U}_{A}$ is the formula I think is sensible; the bottom is simply the variance of the data, and the top is the variance of the residuals. The $∥$ indicates the $L_{2}$ norm over the dimension of the vector $x$ . I believe it matches Wikipedia's definition of FVU and R squared.

{F V U}_{B} = \frac{1}{N} N \sum n = 1 \frac{∥ x_{n} - x_{n, p r e d} ∥^{2}}{∥ x_{n} - μ ∥^{2}}

${F V U}_{B}$ is the formula used by SAELens and SAEBench. It seems less pri... (read more)

1Oliver Clive-Griffin2mo

This was really helpful, thanks! just wanting to clear up my understanding: This is the wikipedia entry for FVU: where: There's no mention of norms because (as I understand) y and ^y are assumed to be scalar values so SSerr and SStot are scalar. Do I understand it correctly that you're treating ∥xn−xn,pred∥2 as the multi-dimensional equivalent of SSerr and ∥xn−μ∥2 as the multi-dimensional equivalent of SStot? This would make sense as using the squared norms of the differences makes it basis / rotation invariant.

1Archimedes3mo

FVU_B doesn't make sense but I don't see where you're getting FVU_B from. Here's the code I'm seeing: resid_sum_of_squares = ( (flattened_sae_input - flattened_sae_out).pow(2).sum(dim=-1) ) total_sum_of_squares = ( (flattened_sae_input - flattened_sae_input.mean(dim=0)).pow(2).sum(-1) ) mse = resid_sum_of_squares / flattened_mask.sum() explained_variance = 1 - resid_sum_of_squares / total_sum_of_squares Explained variance = 1 - FVU = 1 - (residual sum of squares) / (total sum of squares)

8notfnofn3mo

I would be very surprised if this FVU_B actually another definition and not a bug. It's not a fraction of the variance and those denominators can easily be zero or very near zero.

2Gurkenglas3mo

https://github.com/jbloomAus/SAELens/blob/main/sae_lens/evals.py#L511 sums the numerator and denominator separately, if they aren't doing that in some other place probably just file a bug report?

StefanHex's Shortform

StefanHex3mo20

Same plot but using SAEBench's FVU definition. Matches this Neuronpedia page.

StefanHex's Shortform

StefanHex3mo20

I'm going to update the results in the top-level comment with the corrected data; I'm pasting the original figures here for posterity / understanding the past discussion. Summary of changes:

[Minor] I didn't subtract the mean in the variance calculation. This barely had an effect on the results.
[Major] I used a different definition of "Explained Variance" which caused a pretty large difference

Old (no longer true) text:

It turns out that even clustering (essentially L_0=1) explains up to 90% of the variance in activations, being matched only by SAEs with L_0&

... (read more)

StefanHex's Shortform

StefanHex3mo*40

These logs show numbers with the original / corrected explained variance computation; the difference is in the 3-8% range.

v3 (KMeans): Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=4096, variance explained = 0.8887 / 0.8568
v3 (KMeans): Layer blocks.3.hook_resid_post,

... (read more)

StefanHex's Shortform

StefanHex3mo*20

You're right. I forgot subtracting the mean. Thanks a lot!!

I'm computing new numbers now, ~~but indeed I expect this to explain my result!~~ (Edit: Seems to not change too much)

4StefanHex3mo

After adding the mean subtraction, the numbers haven't changed too much actually -- but let me make sure I'm using the correct calculation. I'm gonna follow your and @Adam Karvonen's suggestion of using the SAE bench code and loading my clustering solution as an SAE (this code). These logs show numbers with the original / corrected explained variance computation; the difference is in the 3-8% range. v3 (KMeans): Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=4096, variance explained = 0.8887 / 0.8568 v3 (KMeans): Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=16384, variance explained = 0.9020 / 0.8740 v3 (KMeans): Layer blocks.4.hook_resid_post, n_tokens=1000000, n_clusters=4096, variance explained = 0.8044 / 0.7197 v3 (KMeans): Layer blocks.4.hook_resid_post, n_tokens=1000000, n_clusters=16384, variance explained = 0.8261 / 0.7509 PCA+Clustering: Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=4095, n_pca=1, variance explained = 0.8910 / 0.8599 PCA+Clustering: Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=16383, n_pca=1, variance explained = 0.9041 / 0.8766 PCA+Clustering: Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=4094, n_pca=2, variance explained = 0.8948 / 0.8647 PCA+Clustering: Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=16382, n_pca=2, variance explained = 0.9076 / 0.8812 PCA+Clustering: Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=4091, n_pca=5, variance explained = 0.9044 / 0.8770 PCA+Clustering: Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=16379, n_pca=5, variance explained = 0.9159 / 0.8919 PCA+Clustering: Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=4086, n_pca=10, variance explained = 0.9121 / 0.8870 PCA+Clustering: Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=16374, n_pca=10, variance explained = 0.9232 / 0.9012 PCA+Clustering: Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=4076, n_pc

StefanHex's Shortform

StefanHex3mo20

I should really run a random Gaussian data baseline for this.

~~Tentatively I get similar results (70-85% variance explained) for random data -- I haven't checked that code at all though, don't trust this. Will double check this tomorrow.~~

~~(In that case SAE's performance would also be unsurprising I suppose)~~

[This comment is no longer endorsed by its author]Reply

StefanHex's Shortform

StefanHex3mo40

If we imagine that the meaning is given not by the dimensions of the space but rather by regions/points/volumes of the space

I think this is what I care about finding out. If you're right this is indeed not surprising nor an issue, but you being right would be a major departure from the current mainstream interpretability paradigm(?).

The question of regions vs compositionality is what I've been investigating with my mentees recently, and pretty keen on. I'll want to write up my current thoughts on this topic sometime soon.

StefanHex's Shortform

StefanHex3mo40

What do you mean you’re encoding/decoding like normal but using the k means vectors?

So I do something like

        latents_tmp = torch.einsum("bd,nd->bn", data, centroids)
        max_latent = latents_tmp.argmax(dim=-1)  # shape: [batch]
        latents = one_hot(max_latent)

where the first line is essentially an SAE embedding (and centroids are the features), and the second/third line is a top-k. And for reconstruction do something like

    recon = centroids @ latents

which should also be equivalent.

Shouldn’t the SAE training process for a top k

... (read more)

5Josh Engels3mo

I just tried to replicate this on GPT-2 with expansion factor 4 (so total number of centroids = 768 * 4). I get that clustering recovers ~87% fraction of variance explained, while a k = 32 SAE gets more like 95% variance explained. I did the nonlinear version of finding nearest neighbors when using k means to give k means the biggest advantage possible, and did k-means clustering on points using the FAISS clustering library. Definitely take this with a grain of salt, I'm going to look through my code and see if I can reproduce your results on pythia too, and if so try on a larger model to. Code: https://github.com/JoshEngels/CheckClustering/tree/main

StefanHex's Shortform

StefanHex3mo40

I'm not sure what you mean by "K-means clustering baseline (with K=1)". I would think the K in K-means stands for the number of means you use, so with K=1, you're just taking the mean direction of the weights. I would expect this to explain maybe 50% of the variance (or less), not 90% of the variance.

Thanks for pointing this out! I confused nomenclature, will fix!

Edit: Fixed now. I confused

the number of clusters ("K") / dictionary size
the number of latents ("L_0" or k in top-k SAEs). Some clustering methods allow you to assign multiple clusters to on

... (read more)

StefanHex's Shortform

StefanHex3mo122

this seems concerning.

I feel like my post appears overly dramatic; I'm not very surprised and don't consider this the strongest evidence against SAEs. It's an experiment I ran a while ago and it hasn't changed my (somewhat SAE-sceptic) stance much.

But this is me having seen a bunch of other weird SAE behaviours (pre-activation distributions are not the way you'd expect from the superposition hypothesis h/t @jake_mendel, if you feed SAE-reconstructed activations back into the encoder the SAE goes nuts, stuff mentioned in recent Apollo papers, ...).

Reasons t... (read more)

2StefanHex3mo

Tentatively I get similar results (70-85% variance explained) for random data -- I haven't checked that code at all though, don't trust this. Will double check this tomorrow. (In that case SAE's performance would also be unsurprising I suppose)

2Alexander Gietelink Oldenziel3mo

Is there a benchmark in which SAEs clearly, definitely outperform standard techniques?

StefanHex's Shortform

StefanHex3mo*47-1

Edited to fix errors pointed out by @JoshEngels and @Adam Karvonen (mainly: different definition for explained variance, details here).

Summary: K-means explains 72 - 87% of the variance in the activations, comparable to vanilla SAEs but less than better SAEs. I think this (bug-fixed) result is neither evidence in favour of SAEs nor against; the Clustering & SAE numbers make a straight-ish line on a log plot.

Epistemic status: This is a weekend-experiment I ran a while ago and I figured I should write it up to share. I have taken decent care to check my ... (read more)

2StefanHex3mo

Same plot but using SAEBench's FVU definition. Matches this Neuronpedia page.

2StefanHex3mo

I'm going to update the results in the top-level comment with the corrected data; I'm pasting the original figures here for posterity / understanding the past discussion. Summary of changes: 1. [Minor] I didn't subtract the mean in the variance calculation. This barely had an effect on the results. 2. [Major] I used a different definition of "Explained Variance" which caused a pretty large difference Old (no longer true) text:

1Andrew Mack3mo

I think the relation between K-means and sparse dictionary learning (essentially K-means is equivalent to an L_0=1 constraint) is already well-known in the sparse coding literature? For example see this wiki article on K-SVD (a sparse dictionary learning algorithm) which first reviews this connection before getting into the nuances of k-SVD. Were the SAEs for this comparison trained on multiple passes through the data, or just one pass/epoch? Because if for K-means you did multiple passes through the data but for SAEs just one then this feels like an unfair comparison.

Josh Engels3mo235

I was having trouble reproducing your results on Pythia, and was only able to get 60% variance explained. I may have tracked it down: I think you may be computing FVU incorrectly.

https://gist.github.com/Stefan-Heimersheim/ff1d3b92add92a29602b411b9cd76cec#file-clustering_pythia-py-L309

I think FVU is correctly computed by subtracting the mean from each dimension when computing the denominator. See the SAEBench impl here:

https://github.com/adamkarvonen/SAEBench/blob/5204b4822c66a838d9c9221640308e7c23eda00a/sae_bench/evals/core/main.py#L566

When I used yo... (read more)

4tailcalled3mo

I'm not sure what you mean by "K-means clustering baseline (with K=1)". I would think the K in K-means stands for the number of means you use, so with K=1, you're just taking the mean direction of the weights. I would expect this to explain maybe 50% of the variance (or less), not 90% of the variance. But anyway, under my current model (roughly Why I'm bearish on mechanistic interpretability: the shards are not in the network + Binary encoding as a simple explicit construction for superposition) it seems about as natural to use K-means as it does to use SAEs, and not necessarily an issue if K-means outperforms SAEs. If we imagine that the meaning is given not by the dimensions of the space but rather by regions/points/volumes of the space, then K-means seems like a perfectly cromulent quantization for identifying these volumes. The major issue is where we go from here.

1Josh Engels3mo

What do you mean you’re encoding/decoding like normal but using the k means vectors? Shouldn’t the SAE training process for a top k SAE with k = 1 find these vectors then? In general I’m a bit skeptical that clustering will work as well on larger models, my impression is that most small models have pretty token level features which might be pretty clusterable with k=1, but for larger models many activations may belong to multiple “clusters”, which you need dictionary learning for.

8Alexander Gietelink Oldenziel3mo

this seems concerning. Can somebody ELI5 what's going on here?

The Failed Strategy of Artificial Intelligence Doomers

StefanHex3mo1513

I’ve just read the article, and found it indeed very thought provoking, and I will be thinking more about it in the days to come.

One thing though I kept thinking: Why doesn’t the article mention AI Safety research much?

In the passage

The only policy that AI Doomers mostly agree on is that AI development should be slowed down somehow, in order to “buy time.”

I was thinking: surely most people would agree on policies like “Do more research into AI alignment” / “Spend more money on AI Notkilleveryoneism research”?

In general the article frames the policy to ... (read more)

6Davidmanheim3mo

Because almost all of current AI safety research can't make future agentic ASI that isn't already aligned with human values safe, as everyone who has looked at the problem seems to agree. And the Doomers certainly have been clear about this, even as most of the funding goes to prosaic alignment.

Implementing activation steering

StefanHex3mo20

Thanks for writing these up! I liked that you showed equivalent examples in different libraries, and included the plain “from scratch” version.

Dmitry's Koan

StefanHex3mo50

Hmm, I think I don't fully understand your post. Let me summarize what I get, and what is confusing me:

I absolutely get the "there are different levels / scales of explaining a network" point
It also makes sense to tie this to some level of loss. E.g. explain GPT-2 to a loss level of L=3.0 (rather than L=2.9), or explain IOI with 95% accuracy.
I'm also a fan of expressing losses in terms of compute or model size ("SAE on Claude 5 recovers Claude 2-levels of performance").

I'm confused whether your post tries to tell us (how to determine) what loss our interpr... (read more)

2Dmitry Vaintrob3mo

Thanks for the questions! Sorry, I think the context of the Watanabe scale is a bit confusing. I'm saying that in fact it's the wrong scale to use as a "natural scale". The Watanabe scale depends only on the number of training datapoints, and doesn't notice any other properties of your NN or your phenomenon of interest. Roughly, the Watanabe scale is the scale on which loss improves if you memorize a single datapoint (so memorizing improves accuracy by 1/n with n = #(training set) and in a suitable operationalization, improves loss by O(logn/n), and this is the Watanabe scale). It's used in SLT roughly because it's the minimal temperature scale where "memorization doesn't count as relevant", and so relevant measurements become independent of the n-point sample. However in most interp experiments, the realistic loss reconstruction loss reconstruction is much rougher (i.e., further from optimal loss) than the 1/n scale where memorization becomes an issue (even if you conceptualize #(training set) as some small synthetic training set that you were running the experiment on). For your second question: again, what I wrote is confusing and I really want to rewrite it more clearly later. I tried to clarify what I think you're asking about in this shortform. Roughly, the point here is that to avoid having your results messed up by spurious behaviors, you might want to degrade as much as possible while still observing the effect of your experiment. The idea is that if you found any degradation that wasn't explicitly designed with your experiment in mind (i.e., is natural), but where you see your experimental results hold, then you have "found a phenomenon". The hope is that if you look at the roughest such scale, you might kill enough confounders and interactions to make your result be "clean" (or at least cleaner): so for example optimistically you might hope to explain all the loss of the degraded model at the degradation scale you chose (whereas at other scales, th

Logits, log-odds, and loss for parallel circuits

StefanHex3mo40

Great read! I think you explained well the intuition why logits / logprobs are so natural (I haven't managed to do this well in a past attempt). I like the suggestion that (a) NNs consist of parallel mechanisms to get the answer, and (b) the best way to combine multiple predictions is via adding logprobs.

I haven't grokked your loss scales explanation (the "interpretability insights" section) without reading your other post though.

2Dmitry Vaintrob3mo

Thanks! Not saying anything deep here. The point is just that you might have two cartoon pictures: 1. every correctly classified input is either the result of a memorizing circuit or of a single coherent generalizing circuit behavior. If you remove a single generalizing circuit, your accuracy will degrade additively. 2. a correctly classified input is the result of a "combined" circuit consisting of multiple parallel generalizing "subprocesses" giving independent predictions, and if you remove any of these subprocesses, your accuracy will degrade multiplicatively. A lot of ML work only thinks about picture #1 (which is the natural picture to look at if you only have one generalizing circuit and every other circuit is a memorization). But the thing I'm saying is that picture #2 also occurs, and in some sense is "the info-theoretic default" (though both occur simultaneously -- this is also related to the ideas in this post)

My January alignment theory Nanowrimo

StefanHex4mo50

Keen on reading those write-ups, I appreciate the commitment!

How to use and interpret activation patching

StefanHex4mo20

Simultaneously; as they lead to separate paths both of which are needed as inputs for the final node.

StefanHex's Shortform

StefanHex5mo*91

List of some larger mech interp project ideas (see also: short and medium-sized ideas). Feel encouraged to leave thoughts in the replies below!

Edit: My mentoring doc has more-detailed write-ups of some projects. Let me know if you're interested!

What is going on with activation plateaus: Transformer activations space seems to be made up of discrete regions, each corresponding to a certain output distribution. Most activations within a region lead to the same output, and the output changes sharply when you move from one region to another. The boundaries seem... (read more)

StefanHex's Shortform

StefanHex5mo*60

List of some medium-sized mech interp project ideas (see also: shorter and longer ideas). Feel encouraged to leave thoughts in the replies below!

Edit: My mentoring doc has more-detailed write-ups of some projects. Let me know if you're interested!

Toy model of Computation in Superposition: The toy model of computation in superposition (CIS; Circuits-in-Sup, Comp-in-Sup post / paper) describes a way in which NNs could perform computation in superposition, rather than just storing information in superposition (TMS). It would be good to have some actually trai... (read more)

StefanHex's Shortform

StefanHex5mo*40

List of some short mech interp project ideas (see also: medium-sized and longer ideas). Feel encouraged to leave thoughts in the replies below!

Edit: My mentoring doc has more-detailed write-ups of some projects. Let me know if you're interested!

Directly testing the linear representation hypothesis by making up a couple of prompts which contain a few concepts to various degrees and test

Does the model indeed represent intensity as magnitude? Or are there separate features for separately intense versions of a concept? Finding the right prompts is tricky, e.g.

... (read more)

StefanHex's Shortform

StefanHex5mo30

CLDR (Cross-layer distributed representation): I don't think Lee has written his up anywhere yet so I've removed this for now.

Also, just wanted to flag that the links on 'this picture' and 'motivation image' don't currently work.

Thanks for the flag! It's these two images, I realize now that they don't seem to have direct links

Images taken from AMFTC and Crosscoders by Anthropic.

StefanHex's Shortform

StefanHex5mo50

Thanks for the comment!

I think this is what most mech interp researchers more or less think. Though I definitely expect many researchers would disagree with individual points, nor does it fairly weigh all views and aspects (it's very biased towards "people I talk to"). (Also this is in no way an Apollo / Apollo interp team statement, just my personal view.)

StefanHex's Shortform

StefanHex5mo40

Thanks! You're right, totally mixed up local and dense / distributed. Decided to just leave out that terminology

StefanHex's Shortform

StefanHex5mo114

Why I'm not too worried about architecture-dependent mech interp methods:

I've heard people argue that we should develop mechanistic interpretability methods that can be applied to any architecture. While this is certainly a nice-to-have, and maybe a sign that a method is principled, I don't think this criterion itself is important.

I think that the biggest hurdle for interpretability is to understand any AI that produces advanced language (>=GPT2 level). We don't know how to write a non-ML program that speaks English, let alone reason, and we have no ide... (read more)

5Lucius Bushnaq5mo

Agreed. I do value methods being architecture independent, but mostly just because of this: At scale, different architectures trained on the same data seem to converge to learning similar algorithms to some extent. I care about decomposing and understanding these algorithms, independent of the architecture they happen to be implemented on. If a mech interp method is formulated in a mostly architecture independent manner, I take that as a weakly promising sign that it's actually finding the structure of the learned algorithm, instead of structure related to the implementation on one particular architecture.

3bilalchughtai5mo

Agreed. A related thought is that we might only need to be able to interpret a single model at a particular capability level to unlock the safety benefits, as long as we can make a sufficient case that we should use that model. We don't care inherently about interpreting GPT-4, we care about there existing a GPT-4 level model that we can interpret.

4Jozdien5mo

I think the usual reason this claim is made is because the person making the claim thinks it's very plausible LLMs aren't the paradigm that lead to AGI. If that's the case, then interpretability that's indexed heavily on them gets us understanding of something qualitatively weaker than we'd like. I agree that there'll be some transfer, but it seems better and not-very-hard to talk about how well different kinds of work transfer.

StefanHex's Shortform

StefanHex5mo30

Why I'm not that hopeful about mech interp on TinyStories models:

Some of the TinyStories models are open source, and manage to output sensible language while being tiny (say 64dim embedding, 8 layers). Maybe it'd be great to try and thoroughly understand one of those?

I am worried that those models simply implement a bunch of bigrams and trigrams, and that all their performance can be explained by boring statistics & heuristics. Thus we would not learn much from fully understanding such a model. Evidence for this is that the 1-layer variant, which due t... (read more)

StefanHex's Shortform

StefanHex5mo*40-1

Collection of some mech interp knowledge about transformers:

Writing up folk wisdom & recent results, mostly for mentees and as a link to send to people. Aimed at people who are already a bit familiar with mech interp. I've just quickly written down what came to my head, and may have missed or misrepresented some things. In particular, the last point is very brief and deserves a much more expanded comment at some point. The opinions expressed here are my own and do not necessarily reflect the views of Apollo Research.

Transformers take in a sequence of t... (read more)

3Rauno Arike5mo

This is a nice overview, thanks! I don't think I've seen the CLDR acronym before, are the arguments publicly written up somewhere? Also, just wanted to flag that the links on 'this picture' and 'motivation image' don't currently work.

3aribrill5mo

Thanks for the great writeup. Typo: I think you meant to write distributed, not local, codes. A local code is the opposite of superposition.

3[anonymous]5mo

Who is "we"? Is it: 1. only you and your team? 2. the entire Apollo Research org? 3. the majority of mechinterp researchers worldwide? 4. some other group/category of people? Also, this definitely deserves to be made into a high-level post, if you end up finding the time/energy/interest in making one.

2Matt Goldenberg5mo

this is great, thanks for sharing

A gentle introduction to mechanistic anomaly detection

StefanHex6mo50

Thanks for the nice writeup! I'm confused about why you can get away without interpretation of what the model components are:

In cases where we worry that our model learned a human-simulator / camera-simulator rather than actually predicting whether the diamond exists, wouldn't circuit discovery simply give us the human-simulator circuit? (And thus causal scrubbing doesn't save us.) I'm thinking in particular of cases where the human-simulator is easier to learn than the intended solution.

Of course if you had good interpretability, a way to realise whether ... (read more)

2Erik Jenner6mo

You're totally right that this is an important difficulty I glossed over, thanks! TL;DR: I agree you need some extra ingredient to deal with cases where (AI-augmented) humans can't supervise, and this ingredient could be interpretability. On the other hand, there's at least one (somewhat speculative) alternative to interp (and MAD is also potentially useful if you can only deal with cases humans can supervise with enough effort, e.g., to defend against scheming). ---------------------------------------- Just to restate things a bit, I'd distinguish two cases: * "In-distribution anomaly detection:" we are fine with flagging any input as "anomalous" that's OOD compared to the trusted distribution * "Off-distribution anomaly detection:" there are some inputs that are OOD but that we still want to classify as "normal" In-distribution anomaly detection can already be useful (mainly to deal with rare high-stakes failures). For example, if a human can verify that no tampering occurred with enough effort, then we might be able to create a trusted distribution that covers so many cases that we're fine with flagging everything that's OOD. But we might still want off-distribution anomaly detection, where the anomaly detector generalizes as intended from easy trusted examples to harder untrusted examples. Then we need some additional ingredient to make that generalization work. Paul writes about one approach specifically for measurement tampering here and in the following subsection. Exlusion finetuning (appendix I in Redwood's measurement tampering paper) is a practical implementation of a similar intuition. This does rely on some assumptions about inductive bias, but at least seems more promising to me than just hoping to get a direct translator from normal training. I think ARC might have hopes to solve ELK more broadly (rather than just measurement tampering), but I understand those less (and maybe they're just "use a measurement tampering detector to bootstrap to

Physics of Language models (part 2.1)

StefanHex7mo70

Paper link: https://arxiv.org/abs/2407.20311

(I have neither watched the video nor read the paper yet, just in case someone else was looking for the non-video version)

You can remove GPT2’s LayerNorm by fine-tuning for an hour

StefanHex7mo20

Thanks! I'll edit it

Why I'm bearish on mechanistic interpretability: the shards are not in the network

StefanHex8mo20

[…] no reason to be concentrated in any one spot of the network (whether activation-space or weight-space). So studying weights and activations is pretty doomed.

I find myself really confused by this argument. Shards (or anything) do not need to be “concentrated in one spot” for studying them to make sense?

As Neel and Lucius say, you might study SAE latents or abstractions built on the weights, no one requires (or assumes) than things are concentrated in one spot.

Or to make another analogy, one can study neuroscience even though things are not concentrat... (read more)

2tailcalled8mo

I don't doubt you can find may facts about SAE latents, I just don't think they will be relevant for anything that matters. I'm by-default bearish on neuroscience too, though it's more nuanced there. Feeding the output into the input isn't much thinking. It just allows the thinking to occur in a very diffuse way.

Habryka's Shortform Feed

StefanHex8mo20

Even after reading this (2 weeks ago), I today couldn't manage to find the comment link and manually scrolled down. I later noticed it (at the bottom left) but it's so far away from everything else. I think putting it somewhere at the top near the rest of the UI would be much easier for me

4habryka8mo

Yeah, we'll probably make that adjustment soon. I also currently think the comment link is too hidden, even after trying to get used to it for a while.

LessWrong email subscriptions?

StefanHex8mo20

I would like the following subscription: All posts with certain tags, e.g. all [AI] posts or all [Interpretability (ML & AI)] posts.

I just noticed (and enabled) a “subscribe” feature in the page for the tag, it says “Get notifications when posts are added to this tag.” — I’m unsure if those are emails, but assuming they are, my problem is solved. I never noticed this option before.

2Raemon8mo

I think by default they are site-notifications, but in your user settings you can change them to emails.

You can remove GPT2’s LayerNorm by fine-tuning for an hour

StefanHex9mo*Ω120

And here's the code to do it with replacing the LayerNorms with identities completely:

import torch
from transformers import GPT2LMHeadModel
from transformer_lens import HookedTransformer

model = GPT2LMHeadModel.from_pretrained("apollo-research/gpt2_noLN").to("cpu")

# Undo my hacky LayerNorm removal
for block in model.transformer.h:
    block.ln_1.weight.data = block.ln_1.weight.data / 1e6
    block.ln_1.eps = 1e-5
    block.ln_2.weight.data = block.ln_2.weight.data / 1e6
    block.ln_2.eps = 1e-5
model.transformer.ln_f.weight.data = model.transformer.ln_

... (read more)

3Quiche Eater8mo

You should also set model.cfg.normalization_type = None afterwards. It's mostly a formality since you're doing it after initialization. ActivationCache.apply_ln_to_stack() is the only function I found which behaves incorrectly if you don't change this.

2Logan Riggs8mo

And here's the code to convert it to NNsight (Thanks Caden for writing this awhile ago!) import torch from transformers import GPT2LMHeadModel from transformer_lens import HookedTransformer from nnsight.models.UnifiedTransformer import UnifiedTransformer model = GPT2LMHeadModel.from_pretrained("apollo-research/gpt2_noLN").to("cpu") # Undo my hacky LayerNorm removal for block in model.transformer.h: block.ln_1.weight.data = block.ln_1.weight.data / 1e6 block.ln_1.eps = 1e-5 block.ln_2.weight.data = block.ln_2.weight.data / 1e6 block.ln_2.eps = 1e-5 model.transformer.ln_f.weight.data = model.transformer.ln_f.weight.data / 1e6 model.transformer.ln_f.eps = 1e-5 # Properly replace LayerNorms by Identities def removeLN(transformer_lens_model): for i in range(len(transformer_lens_model.blocks)): transformer_lens_model.blocks[i].ln1 = torch.nn.Identity() transformer_lens_model.blocks[i].ln2 = torch.nn.Identity() transformer_lens_model.ln_final = torch.nn.Identity() hooked_model = HookedTransformer.from_pretrained("gpt2", hf_model=model, fold_ln=True, center_unembed=False).to("cpu") removeLN(hooked_model) model_nnsight = UnifiedTransformer(model="gpt2", hf_model=model, fold_ln=True, center_unembed=False).to("cpu") removeLN(model_nnsight) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") prompt = torch.tensor([1,2,3,4], device=device) logits = hooked_model(prompt) with torch.no_grad(), model_nnsight.trace(prompt) as runner: logits2 = model_nnsight.unembed.output.save() logits, cache = hooked_model.run_with_cache(prompt) torch.allclose(logits, logits2)