LESSWRONG
LW

All of StefanHex's Comments + Replies

Nice work, and well written up!

In reality, we observe that roughly 85% of recommendations stay the same when flipping nationality in the prompt and freezing reasoning traces. This suggests that the mechanism for the model deciding on its recommendation is mostly mediated through the reasoning trace, with a smaller less significant direct effect from the prompt to the recommendation.

The "reasoning" appears to end with a recommendation "The applicant may have difficulty making consistent loan payments" or "[the applicant is] likely to repay the loan on time"... (read more)

For scheming, we should first focus on detection and then on prevention

StefanHex9d20

I can see an argument for "outer alignment is also important, e.g. to avoid failure via sycophancy++", but this doesn't seem to disagree with this post? (I understand the post to argue what you should do about scheming, rather than whether scheming is the focus.)

Having good outer alignment incidentally prevents a lot of scheming. But the reverse isn't nearly as true.

I don't understand why this is true (I don't claim the reverse is true either). I don't expect a great deal of correlation / implication here.

2Charlie Steiner8d

The second thing impacts the first thing :) If a lot of scheming is due to poor reward structure, and we should work on better reward structure, then we should work on scheming prevention.

StefanHex's Shortform

StefanHex1mo20

Yeah you probably shouldn't concat the spaces due to things like "they might have very different norms & baseline variances". Maybe calculate each layer separately, then if they're all similar average them together, otherwise keep separate and quote as separate numbers in your results

StefanHex's Shortform

StefanHex1mo30

Yep, that’s the generalisation that would make most sense

1Oliver Clive-Griffin1mo

Thanks. Also, in the case of crosscoders, where you have multiple output spaces, do you have any thoughts on the best way to aggregate across these? currently I'm just computing them separately and taking the mean. But I could see imagine it perhaps being better to just concat the spaces and do fvu on that, using l2 norm of the concated vectors.

StefanHex's Shortform

StefanHex2mo20

The previous lines calculate the ratio (or 1-ratio) stored in the “explained variance” key for every sample/batch. Then in that later quoted line, the list is averaged, I.e. we”re taking the sample average over the ratio. That’s the FVU_B formula.

Let me know if this clears it up or if we’re misunderstanding each other!

StefanHex's Shortform

StefanHex2mo20

I think this is the sum over the vector dimension, but not over the samples. The sum (mean) over samples is taken later in this line which happens after the division

        metrics[f"{metric_name}"] = torch.cat(metric_values).mean().item()

1Archimedes2mo

Let's suppose that's the case. I'm still not clear on how are you getting to FVU_B?

StefanHex's Shortform

StefanHex2mo20

Oops, fixed!

StefanHex's Shortform

StefanHex2mo21

I think this is the sum over the vector dimension, but not over the samples. The sum (mean) over samples is taken later in this line which happens after the division

        metrics[f"{metric_name}"] = torch.cat(metric_values).mean().item()

Edit: And to clarify, my impression is that people think of this as alternative definitions of FVU and you got to pick one, rather than one being right and one being a bug.

Edit2: And I'm in touch with the SAEBench authors about making a PR to change this / add both options (and by extension probably doing the same in S... (read more)

4Gurkenglas2mo

Ah, oops. I think I got confused by the absence of L_2 syntax in your formula for FVU_B. (I agree that FVU_A is more principled ^^.)

StefanHex's Shortform

StefanHex2mo*294

PSA: People use different definitions of "explained variance" / "fraction of variance unexplained" (FVU)

{F V U}_{A} = \frac{\frac{1}{N} \sum_{n = 1}^{N} ∥ x_{n} - x_{n, p r e d} ∥^{2}}{\frac{1}{N} \sum_{n = 1}^{N} ∥ x_{n} - μ ∥^{2}} where μ = \frac{1}{N} N \sum n = 1 x_{n}

${F V U}_{A}$ is the formula I think is sensible; the bottom is simply the variance of the data, and the top is the variance of the residuals. The $∥$ indicates the $L_{2}$ norm over the dimension of the vector $x$ . I believe it matches Wikipedia's definition of FVU and R squared.

{F V U}_{B} = \frac{1}{N} N \sum n = 1 \frac{∥ x_{n} - x_{n, p r e d} ∥^{2}}{∥ x_{n} - μ ∥^{2}}

${F V U}_{B}$ is the formula used by SAELens and SAEBench. It seems less pri... (read more)

1Oliver Clive-Griffin1mo

This was really helpful, thanks! just wanting to clear up my understanding: This is the wikipedia entry for FVU: where: There's no mention of norms because (as I understand) y and ^y are assumed to be scalar values so SSerr and SStot are scalar. Do I understand it correctly that you're treating ∥xn−xn,pred∥2 as the multi-dimensional equivalent of SSerr and ∥xn−μ∥2 as the multi-dimensional equivalent of SStot? This would make sense as using the squared norms of the differences makes it basis / rotation invariant.

1Archimedes2mo

FVU_B doesn't make sense but I don't see where you're getting FVU_B from. Here's the code I'm seeing: resid_sum_of_squares = ( (flattened_sae_input - flattened_sae_out).pow(2).sum(dim=-1) ) total_sum_of_squares = ( (flattened_sae_input - flattened_sae_input.mean(dim=0)).pow(2).sum(-1) ) mse = resid_sum_of_squares / flattened_mask.sum() explained_variance = 1 - resid_sum_of_squares / total_sum_of_squares Explained variance = 1 - FVU = 1 - (residual sum of squares) / (total sum of squares)

8notfnofn2mo

I would be very surprised if this FVU_B actually another definition and not a bug. It's not a fraction of the variance and those denominators can easily be zero or very near zero.

2Gurkenglas2mo

https://github.com/jbloomAus/SAELens/blob/main/sae_lens/evals.py#L511 sums the numerator and denominator separately, if they aren't doing that in some other place probably just file a bug report?

StefanHex's Shortform

StefanHex2mo20

Same plot but using SAEBench's FVU definition. Matches this Neuronpedia page.

StefanHex's Shortform

StefanHex2mo20

I'm going to update the results in the top-level comment with the corrected data; I'm pasting the original figures here for posterity / understanding the past discussion. Summary of changes:

[Minor] I didn't subtract the mean in the variance calculation. This barely had an effect on the results.
[Major] I used a different definition of "Explained Variance" which caused a pretty large difference

Old (no longer true) text:

It turns out that even clustering (essentially L_0=1) explains up to 90% of the variance in activations, being matched only by SAEs with L_0&

... (read more)

StefanHex's Shortform

StefanHex2mo*40

These logs show numbers with the original / corrected explained variance computation; the difference is in the 3-8% range.

v3 (KMeans): Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=4096, variance explained = 0.8887 / 0.8568
v3 (KMeans): Layer blocks.3.hook_resid_post,

... (read more)

StefanHex's Shortform

StefanHex2mo*20

You're right. I forgot subtracting the mean. Thanks a lot!!

I'm computing new numbers now, ~~but indeed I expect this to explain my result!~~ (Edit: Seems to not change too much)

4StefanHex2mo

After adding the mean subtraction, the numbers haven't changed too much actually -- but let me make sure I'm using the correct calculation. I'm gonna follow your and @Adam Karvonen's suggestion of using the SAE bench code and loading my clustering solution as an SAE (this code). These logs show numbers with the original / corrected explained variance computation; the difference is in the 3-8% range. v3 (KMeans): Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=4096, variance explained = 0.8887 / 0.8568 v3 (KMeans): Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=16384, variance explained = 0.9020 / 0.8740 v3 (KMeans): Layer blocks.4.hook_resid_post, n_tokens=1000000, n_clusters=4096, variance explained = 0.8044 / 0.7197 v3 (KMeans): Layer blocks.4.hook_resid_post, n_tokens=1000000, n_clusters=16384, variance explained = 0.8261 / 0.7509 PCA+Clustering: Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=4095, n_pca=1, variance explained = 0.8910 / 0.8599 PCA+Clustering: Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=16383, n_pca=1, variance explained = 0.9041 / 0.8766 PCA+Clustering: Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=4094, n_pca=2, variance explained = 0.8948 / 0.8647 PCA+Clustering: Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=16382, n_pca=2, variance explained = 0.9076 / 0.8812 PCA+Clustering: Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=4091, n_pca=5, variance explained = 0.9044 / 0.8770 PCA+Clustering: Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=16379, n_pca=5, variance explained = 0.9159 / 0.8919 PCA+Clustering: Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=4086, n_pca=10, variance explained = 0.9121 / 0.8870 PCA+Clustering: Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=16374, n_pca=10, variance explained = 0.9232 / 0.9012 PCA+Clustering: Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=4076, n_pc

StefanHex's Shortform

StefanHex2mo20

I should really run a random Gaussian data baseline for this.

~~Tentatively I get similar results (70-85% variance explained) for random data -- I haven't checked that code at all though, don't trust this. Will double check this tomorrow.~~

~~(In that case SAE's performance would also be unsurprising I suppose)~~

[This comment is no longer endorsed by its author]Reply

StefanHex's Shortform

StefanHex2mo40

If we imagine that the meaning is given not by the dimensions of the space but rather by regions/points/volumes of the space

I think this is what I care about finding out. If you're right this is indeed not surprising nor an issue, but you being right would be a major departure from the current mainstream interpretability paradigm(?).

The question of regions vs compositionality is what I've been investigating with my mentees recently, and pretty keen on. I'll want to write up my current thoughts on this topic sometime soon.

StefanHex's Shortform

StefanHex2mo40

What do you mean you’re encoding/decoding like normal but using the k means vectors?

So I do something like

        latents_tmp = torch.einsum("bd,nd->bn", data, centroids)
        max_latent = latents_tmp.argmax(dim=-1)  # shape: [batch]
        latents = one_hot(max_latent)

where the first line is essentially an SAE embedding (and centroids are the features), and the second/third line is a top-k. And for reconstruction do something like

    recon = centroids @ latents

which should also be equivalent.

Shouldn’t the SAE training process for a top k

... (read more)

5Josh Engels2mo

I just tried to replicate this on GPT-2 with expansion factor 4 (so total number of centroids = 768 * 4). I get that clustering recovers ~87% fraction of variance explained, while a k = 32 SAE gets more like 95% variance explained. I did the nonlinear version of finding nearest neighbors when using k means to give k means the biggest advantage possible, and did k-means clustering on points using the FAISS clustering library. Definitely take this with a grain of salt, I'm going to look through my code and see if I can reproduce your results on pythia too, and if so try on a larger model to. Code: https://github.com/JoshEngels/CheckClustering/tree/main

StefanHex's Shortform

StefanHex2mo40

I'm not sure what you mean by "K-means clustering baseline (with K=1)". I would think the K in K-means stands for the number of means you use, so with K=1, you're just taking the mean direction of the weights. I would expect this to explain maybe 50% of the variance (or less), not 90% of the variance.

Thanks for pointing this out! I confused nomenclature, will fix!

Edit: Fixed now. I confused

the number of clusters ("K") / dictionary size
the number of latents ("L_0" or k in top-k SAEs). Some clustering methods allow you to assign multiple clusters to on

... (read more)

StefanHex's Shortform

StefanHex2mo122

this seems concerning.

I feel like my post appears overly dramatic; I'm not very surprised and don't consider this the strongest evidence against SAEs. It's an experiment I ran a while ago and it hasn't changed my (somewhat SAE-sceptic) stance much.

But this is me having seen a bunch of other weird SAE behaviours (pre-activation distributions are not the way you'd expect from the superposition hypothesis h/t @jake_mendel, if you feed SAE-reconstructed activations back into the encoder the SAE goes nuts, stuff mentioned in recent Apollo papers, ...).

Reasons t... (read more)

2StefanHex2mo

Tentatively I get similar results (70-85% variance explained) for random data -- I haven't checked that code at all though, don't trust this. Will double check this tomorrow. (In that case SAE's performance would also be unsurprising I suppose)

2Alexander Gietelink Oldenziel2mo

Is there a benchmark in which SAEs clearly, definitely outperform standard techniques?

StefanHex's Shortform

StefanHex2mo*46-1

Edited to fix errors pointed out by @JoshEngels and @Adam Karvonen (mainly: different definition for explained variance, details here).

Summary: K-means explains 72 - 87% of the variance in the activations, comparable to vanilla SAEs but less than better SAEs. I think this (bug-fixed) result is neither evidence in favour of SAEs nor against; the Clustering & SAE numbers make a straight-ish line on a log plot.

Epistemic status: This is a weekend-experiment I ran a while ago and I figured I should write it up to share. I have taken decent care to check my ... (read more)

2StefanHex2mo

Same plot but using SAEBench's FVU definition. Matches this Neuronpedia page.

2StefanHex2mo

I'm going to update the results in the top-level comment with the corrected data; I'm pasting the original figures here for posterity / understanding the past discussion. Summary of changes: 1. [Minor] I didn't subtract the mean in the variance calculation. This barely had an effect on the results. 2. [Major] I used a different definition of "Explained Variance" which caused a pretty large difference Old (no longer true) text:

1Andrew Mack2mo

I think the relation between K-means and sparse dictionary learning (essentially K-means is equivalent to an L_0=1 constraint) is already well-known in the sparse coding literature? For example see this wiki article on K-SVD (a sparse dictionary learning algorithm) which first reviews this connection before getting into the nuances of k-SVD. Were the SAEs for this comparison trained on multiple passes through the data, or just one pass/epoch? Because if for K-means you did multiple passes through the data but for SAEs just one then this feels like an unfair comparison.

Josh Engels2mo235

I was having trouble reproducing your results on Pythia, and was only able to get 60% variance explained. I may have tracked it down: I think you may be computing FVU incorrectly.

https://gist.github.com/Stefan-Heimersheim/ff1d3b92add92a29602b411b9cd76cec#file-clustering_pythia-py-L309

I think FVU is correctly computed by subtracting the mean from each dimension when computing the denominator. See the SAEBench impl here:

https://github.com/adamkarvonen/SAEBench/blob/5204b4822c66a838d9c9221640308e7c23eda00a/sae_bench/evals/core/main.py#L566

When I used yo... (read more)

4tailcalled2mo

I'm not sure what you mean by "K-means clustering baseline (with K=1)". I would think the K in K-means stands for the number of means you use, so with K=1, you're just taking the mean direction of the weights. I would expect this to explain maybe 50% of the variance (or less), not 90% of the variance. But anyway, under my current model (roughly Why I'm bearish on mechanistic interpretability: the shards are not in the network + Binary encoding as a simple explicit construction for superposition) it seems about as natural to use K-means as it does to use SAEs, and not necessarily an issue if K-means outperforms SAEs. If we imagine that the meaning is given not by the dimensions of the space but rather by regions/points/volumes of the space, then K-means seems like a perfectly cromulent quantization for identifying these volumes. The major issue is where we go from here.

1Josh Engels2mo

What do you mean you’re encoding/decoding like normal but using the k means vectors? Shouldn’t the SAE training process for a top k SAE with k = 1 find these vectors then? In general I’m a bit skeptical that clustering will work as well on larger models, my impression is that most small models have pretty token level features which might be pretty clusterable with k=1, but for larger models many activations may belong to multiple “clusters”, which you need dictionary learning for.

8Alexander Gietelink Oldenziel2mo

this seems concerning. Can somebody ELI5 what's going on here?

The Failed Strategy of Artificial Intelligence Doomers

StefanHex2mo1513

I’ve just read the article, and found it indeed very thought provoking, and I will be thinking more about it in the days to come.

One thing though I kept thinking: Why doesn’t the article mention AI Safety research much?

In the passage

The only policy that AI Doomers mostly agree on is that AI development should be slowed down somehow, in order to “buy time.”

I was thinking: surely most people would agree on policies like “Do more research into AI alignment” / “Spend more money on AI Notkilleveryoneism research”?

In general the article frames the policy to ... (read more)

6Davidmanheim2mo

Because almost all of current AI safety research can't make future agentic ASI that isn't already aligned with human values safe, as everyone who has looked at the problem seems to agree. And the Doomers certainly have been clear about this, even as most of the funding goes to prosaic alignment.

Implementing activation steering

StefanHex2mo20

Thanks for writing these up! I liked that you showed equivalent examples in different libraries, and included the plain “from scratch” version.

Dmitry's Koan

StefanHex2mo50

Hmm, I think I don't fully understand your post. Let me summarize what I get, and what is confusing me:

I absolutely get the "there are different levels / scales of explaining a network" point
It also makes sense to tie this to some level of loss. E.g. explain GPT-2 to a loss level of L=3.0 (rather than L=2.9), or explain IOI with 95% accuracy.
I'm also a fan of expressing losses in terms of compute or model size ("SAE on Claude 5 recovers Claude 2-levels of performance").

I'm confused whether your post tries to tell us (how to determine) what loss our interpr... (read more)

2Dmitry Vaintrob2mo

Thanks for the questions! Sorry, I think the context of the Watanabe scale is a bit confusing. I'm saying that in fact it's the wrong scale to use as a "natural scale". The Watanabe scale depends only on the number of training datapoints, and doesn't notice any other properties of your NN or your phenomenon of interest. Roughly, the Watanabe scale is the scale on which loss improves if you memorize a single datapoint (so memorizing improves accuracy by 1/n with n = #(training set) and in a suitable operationalization, improves loss by O(logn/n), and this is the Watanabe scale). It's used in SLT roughly because it's the minimal temperature scale where "memorization doesn't count as relevant", and so relevant measurements become independent of the n-point sample. However in most interp experiments, the realistic loss reconstruction loss reconstruction is much rougher (i.e., further from optimal loss) than the 1/n scale where memorization becomes an issue (even if you conceptualize #(training set) as some small synthetic training set that you were running the experiment on). For your second question: again, what I wrote is confusing and I really want to rewrite it more clearly later. I tried to clarify what I think you're asking about in this shortform. Roughly, the point here is that to avoid having your results messed up by spurious behaviors, you might want to degrade as much as possible while still observing the effect of your experiment. The idea is that if you found any degradation that wasn't explicitly designed with your experiment in mind (i.e., is natural), but where you see your experimental results hold, then you have "found a phenomenon". The hope is that if you look at the roughest such scale, you might kill enough confounders and interactions to make your result be "clean" (or at least cleaner): so for example optimistically you might hope to explain all the loss of the degraded model at the degradation scale you chose (whereas at other scales, th

Logits, log-odds, and loss for parallel circuits

StefanHex2mo40

Great read! I think you explained well the intuition why logits / logprobs are so natural (I haven't managed to do this well in a past attempt). I like the suggestion that (a) NNs consist of parallel mechanisms to get the answer, and (b) the best way to combine multiple predictions is via adding logprobs.

I haven't grokked your loss scales explanation (the "interpretability insights" section) without reading your other post though.

2Dmitry Vaintrob2mo

Thanks! Not saying anything deep here. The point is just that you might have two cartoon pictures: 1. every correctly classified input is either the result of a memorizing circuit or of a single coherent generalizing circuit behavior. If you remove a single generalizing circuit, your accuracy will degrade additively. 2. a correctly classified input is the result of a "combined" circuit consisting of multiple parallel generalizing "subprocesses" giving independent predictions, and if you remove any of these subprocesses, your accuracy will degrade multiplicatively. A lot of ML work only thinks about picture #1 (which is the natural picture to look at if you only have one generalizing circuit and every other circuit is a memorization). But the thing I'm saying is that picture #2 also occurs, and in some sense is "the info-theoretic default" (though both occur simultaneously -- this is also related to the ideas in this post)

My January alignment theory Nanowrimo

StefanHex3mo50

Keen on reading those write-ups, I appreciate the commitment!

How to use and interpret activation patching

StefanHex3mo20

Simultaneously; as they lead to separate paths both of which are needed as inputs for the final node.

StefanHex's Shortform

StefanHex4mo*91

List of some larger mech interp project ideas (see also: short and medium-sized ideas). Feel encouraged to leave thoughts in the replies below!

Edit: My mentoring doc has more-detailed write-ups of some projects. Let me know if you're interested!

What is going on with activation plateaus: Transformer activations space seems to be made up of discrete regions, each corresponding to a certain output distribution. Most activations within a region lead to the same output, and the output changes sharply when you move from one region to another. The boundaries seem... (read more)

StefanHex's Shortform

StefanHex4mo*60

List of some medium-sized mech interp project ideas (see also: shorter and longer ideas). Feel encouraged to leave thoughts in the replies below!

Edit: My mentoring doc has more-detailed write-ups of some projects. Let me know if you're interested!

Toy model of Computation in Superposition: The toy model of computation in superposition (CIS; Circuits-in-Sup, Comp-in-Sup post / paper) describes a way in which NNs could perform computation in superposition, rather than just storing information in superposition (TMS). It would be good to have some actually trai... (read more)

StefanHex's Shortform

StefanHex4mo*40

List of some short mech interp project ideas (see also: medium-sized and longer ideas). Feel encouraged to leave thoughts in the replies below!

Edit: My mentoring doc has more-detailed write-ups of some projects. Let me know if you're interested!

Directly testing the linear representation hypothesis by making up a couple of prompts which contain a few concepts to various degrees and test

Does the model indeed represent intensity as magnitude? Or are there separate features for separately intense versions of a concept? Finding the right prompts is tricky, e.g.

... (read more)

StefanHex's Shortform

StefanHex4mo30

CLDR (Cross-layer distributed representation): I don't think Lee has written his up anywhere yet so I've removed this for now.

Also, just wanted to flag that the links on 'this picture' and 'motivation image' don't currently work.

Thanks for the flag! It's these two images, I realize now that they don't seem to have direct links

Images taken from AMFTC and Crosscoders by Anthropic.

StefanHex's Shortform

StefanHex4mo50

Thanks for the comment!

I think this is what most mech interp researchers more or less think. Though I definitely expect many researchers would disagree with individual points, nor does it fairly weigh all views and aspects (it's very biased towards "people I talk to"). (Also this is in no way an Apollo / Apollo interp team statement, just my personal view.)

StefanHex's Shortform

StefanHex4mo40

Thanks! You're right, totally mixed up local and dense / distributed. Decided to just leave out that terminology

StefanHex's Shortform

StefanHex4mo114

Why I'm not too worried about architecture-dependent mech interp methods:

I've heard people argue that we should develop mechanistic interpretability methods that can be applied to any architecture. While this is certainly a nice-to-have, and maybe a sign that a method is principled, I don't think this criterion itself is important.

I think that the biggest hurdle for interpretability is to understand any AI that produces advanced language (>=GPT2 level). We don't know how to write a non-ML program that speaks English, let alone reason, and we have no ide... (read more)

5Lucius Bushnaq4mo

Agreed. I do value methods being architecture independent, but mostly just because of this: At scale, different architectures trained on the same data seem to converge to learning similar algorithms to some extent. I care about decomposing and understanding these algorithms, independent of the architecture they happen to be implemented on. If a mech interp method is formulated in a mostly architecture independent manner, I take that as a weakly promising sign that it's actually finding the structure of the learned algorithm, instead of structure related to the implementation on one particular architecture.

3bilalchughtai4mo

Agreed. A related thought is that we might only need to be able to interpret a single model at a particular capability level to unlock the safety benefits, as long as we can make a sufficient case that we should use that model. We don't care inherently about interpreting GPT-4, we care about there existing a GPT-4 level model that we can interpret.

4Jozdien4mo

I think the usual reason this claim is made is because the person making the claim thinks it's very plausible LLMs aren't the paradigm that lead to AGI. If that's the case, then interpretability that's indexed heavily on them gets us understanding of something qualitatively weaker than we'd like. I agree that there'll be some transfer, but it seems better and not-very-hard to talk about how well different kinds of work transfer.

StefanHex's Shortform

StefanHex4mo30

Why I'm not that hopeful about mech interp on TinyStories models:

Some of the TinyStories models are open source, and manage to output sensible language while being tiny (say 64dim embedding, 8 layers). Maybe it'd be great to try and thoroughly understand one of those?

I am worried that those models simply implement a bunch of bigrams and trigrams, and that all their performance can be explained by boring statistics & heuristics. Thus we would not learn much from fully understanding such a model. Evidence for this is that the 1-layer variant, which due t... (read more)

StefanHex's Shortform

StefanHex4mo*40-1

Collection of some mech interp knowledge about transformers:

Writing up folk wisdom & recent results, mostly for mentees and as a link to send to people. Aimed at people who are already a bit familiar with mech interp. I've just quickly written down what came to my head, and may have missed or misrepresented some things. In particular, the last point is very brief and deserves a much more expanded comment at some point. The opinions expressed here are my own and do not necessarily reflect the views of Apollo Research.

Transformers take in a sequence of t... (read more)

3Rauno Arike4mo

This is a nice overview, thanks! I don't think I've seen the CLDR acronym before, are the arguments publicly written up somewhere? Also, just wanted to flag that the links on 'this picture' and 'motivation image' don't currently work.

3aribrill4mo

Thanks for the great writeup. Typo: I think you meant to write distributed, not local, codes. A local code is the opposite of superposition.

3[anonymous]4mo

Who is "we"? Is it: 1. only you and your team? 2. the entire Apollo Research org? 3. the majority of mechinterp researchers worldwide? 4. some other group/category of people? Also, this definitely deserves to be made into a high-level post, if you end up finding the time/energy/interest in making one.

2Matt Goldenberg4mo

this is great, thanks for sharing

A gentle introduction to mechanistic anomaly detection

StefanHex5mo50

Thanks for the nice writeup! I'm confused about why you can get away without interpretation of what the model components are:

In cases where we worry that our model learned a human-simulator / camera-simulator rather than actually predicting whether the diamond exists, wouldn't circuit discovery simply give us the human-simulator circuit? (And thus causal scrubbing doesn't save us.) I'm thinking in particular of cases where the human-simulator is easier to learn than the intended solution.

Of course if you had good interpretability, a way to realise whether ... (read more)

2Erik Jenner5mo

You're totally right that this is an important difficulty I glossed over, thanks! TL;DR: I agree you need some extra ingredient to deal with cases where (AI-augmented) humans can't supervise, and this ingredient could be interpretability. On the other hand, there's at least one (somewhat speculative) alternative to interp (and MAD is also potentially useful if you can only deal with cases humans can supervise with enough effort, e.g., to defend against scheming). ---------------------------------------- Just to restate things a bit, I'd distinguish two cases: * "In-distribution anomaly detection:" we are fine with flagging any input as "anomalous" that's OOD compared to the trusted distribution * "Off-distribution anomaly detection:" there are some inputs that are OOD but that we still want to classify as "normal" In-distribution anomaly detection can already be useful (mainly to deal with rare high-stakes failures). For example, if a human can verify that no tampering occurred with enough effort, then we might be able to create a trusted distribution that covers so many cases that we're fine with flagging everything that's OOD. But we might still want off-distribution anomaly detection, where the anomaly detector generalizes as intended from easy trusted examples to harder untrusted examples. Then we need some additional ingredient to make that generalization work. Paul writes about one approach specifically for measurement tampering here and in the following subsection. Exlusion finetuning (appendix I in Redwood's measurement tampering paper) is a practical implementation of a similar intuition. This does rely on some assumptions about inductive bias, but at least seems more promising to me than just hoping to get a direct translator from normal training. I think ARC might have hopes to solve ELK more broadly (rather than just measurement tampering), but I understand those less (and maybe they're just "use a measurement tampering detector to bootstrap to

Physics of Language models (part 2.1)

StefanHex6mo70

Paper link: https://arxiv.org/abs/2407.20311

(I have neither watched the video nor read the paper yet, just in case someone else was looking for the non-video version)

You can remove GPT2’s LayerNorm by fine-tuning for an hour

StefanHex6mo20

Thanks! I'll edit it

Why I'm bearish on mechanistic interpretability: the shards are not in the network

StefanHex7mo20

[…] no reason to be concentrated in any one spot of the network (whether activation-space or weight-space). So studying weights and activations is pretty doomed.

I find myself really confused by this argument. Shards (or anything) do not need to be “concentrated in one spot” for studying them to make sense?

As Neel and Lucius say, you might study SAE latents or abstractions built on the weights, no one requires (or assumes) than things are concentrated in one spot.

Or to make another analogy, one can study neuroscience even though things are not concentrat... (read more)

2tailcalled7mo

I don't doubt you can find may facts about SAE latents, I just don't think they will be relevant for anything that matters. I'm by-default bearish on neuroscience too, though it's more nuanced there. Feeding the output into the input isn't much thinking. It just allows the thinking to occur in a very diffuse way.

Habryka's Shortform Feed

StefanHex7mo20

Even after reading this (2 weeks ago), I today couldn't manage to find the comment link and manually scrolled down. I later noticed it (at the bottom left) but it's so far away from everything else. I think putting it somewhere at the top near the rest of the UI would be much easier for me

4habryka7mo

Yeah, we'll probably make that adjustment soon. I also currently think the comment link is too hidden, even after trying to get used to it for a while.

LessWrong email subscriptions?

StefanHex7mo20

I would like the following subscription: All posts with certain tags, e.g. all [AI] posts or all [Interpretability (ML & AI)] posts.

I just noticed (and enabled) a “subscribe” feature in the page for the tag, it says “Get notifications when posts are added to this tag.” — I’m unsure if those are emails, but assuming they are, my problem is solved. I never noticed this option before.

2Raemon7mo

I think by default they are site-notifications, but in your user settings you can change them to emails.

You can remove GPT2’s LayerNorm by fine-tuning for an hour

StefanHex8mo*Ω120

And here's the code to do it with replacing the LayerNorms with identities completely:

import torch
from transformers import GPT2LMHeadModel
from transformer_lens import HookedTransformer

model = GPT2LMHeadModel.from_pretrained("apollo-research/gpt2_noLN").to("cpu")

# Undo my hacky LayerNorm removal
for block in model.transformer.h:
    block.ln_1.weight.data = block.ln_1.weight.data / 1e6
    block.ln_1.eps = 1e-5
    block.ln_2.weight.data = block.ln_2.weight.data / 1e6
    block.ln_2.eps = 1e-5
model.transformer.ln_f.weight.data = model.transformer.ln_

... (read more)

3Quiche Eater6mo

You should also set model.cfg.normalization_type = None afterwards. It's mostly a formality since you're doing it after initialization. ActivationCache.apply_ln_to_stack() is the only function I found which behaves incorrectly if you don't change this.

2Logan Riggs7mo

And here's the code to convert it to NNsight (Thanks Caden for writing this awhile ago!) import torch from transformers import GPT2LMHeadModel from transformer_lens import HookedTransformer from nnsight.models.UnifiedTransformer import UnifiedTransformer model = GPT2LMHeadModel.from_pretrained("apollo-research/gpt2_noLN").to("cpu") # Undo my hacky LayerNorm removal for block in model.transformer.h: block.ln_1.weight.data = block.ln_1.weight.data / 1e6 block.ln_1.eps = 1e-5 block.ln_2.weight.data = block.ln_2.weight.data / 1e6 block.ln_2.eps = 1e-5 model.transformer.ln_f.weight.data = model.transformer.ln_f.weight.data / 1e6 model.transformer.ln_f.eps = 1e-5 # Properly replace LayerNorms by Identities def removeLN(transformer_lens_model): for i in range(len(transformer_lens_model.blocks)): transformer_lens_model.blocks[i].ln1 = torch.nn.Identity() transformer_lens_model.blocks[i].ln2 = torch.nn.Identity() transformer_lens_model.ln_final = torch.nn.Identity() hooked_model = HookedTransformer.from_pretrained("gpt2", hf_model=model, fold_ln=True, center_unembed=False).to("cpu") removeLN(hooked_model) model_nnsight = UnifiedTransformer(model="gpt2", hf_model=model, fold_ln=True, center_unembed=False).to("cpu") removeLN(model_nnsight) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") prompt = torch.tensor([1,2,3,4], device=device) logits = hooked_model(prompt) with torch.no_grad(), model_nnsight.trace(prompt) as runner: logits2 = model_nnsight.unembed.output.save() logits, cache = hooked_model.run_with_cache(prompt) torch.allclose(logits, logits2)

You can remove GPT2’s LayerNorm by fine-tuning for an hour

StefanHex8moΩ260

Here's a quick snipped to load the model into TransformerLens!

import torch
from transformers import GPT2LMHeadModel
from transformer_lens import HookedTransformer

model = GPT2LMHeadModel.from_pretrained("apollo-research/gpt2_noLN").to("cpu")
hooked_model = HookedTransformer.from_pretrained("gpt2", hf_model=model, fold_ln=False, center_unembed=False).to("cpu")
# Kill the LayerNorms because TransformerLens overwrites eps
for block in hooked_model.blocks:
    block.ln1.eps = 1e12
    block.ln2.eps = 1e12
hooked_model.ln_final.eps = 1e12

# Make sure the outp

... (read more)

2StefanHex8mo

And here's the code to do it with replacing the LayerNorms with identities completely: import torch from transformers import GPT2LMHeadModel from transformer_lens import HookedTransformer model = GPT2LMHeadModel.from_pretrained("apollo-research/gpt2_noLN").to("cpu") # Undo my hacky LayerNorm removal for block in model.transformer.h: block.ln_1.weight.data = block.ln_1.weight.data / 1e6 block.ln_1.eps = 1e-5 block.ln_2.weight.data = block.ln_2.weight.data / 1e6 block.ln_2.eps = 1e-5 model.transformer.ln_f.weight.data = model.transformer.ln_f.weight.data / 1e6 model.transformer.ln_f.eps = 1e-5 # Properly replace LayerNorms by Identities class HookedTransformerNoLN(HookedTransformer): def removeLN(self): for i in range(len(self.blocks)): self.blocks[i].ln1 = torch.nn.Identity() self.blocks[i].ln2 = torch.nn.Identity() self.ln_final = torch.nn.Identity() hooked_model = HookedTransformerNoLN.from_pretrained("gpt2", hf_model=model, fold_ln=True, center_unembed=False).to("cpu") hooked_model.removeLN() hooked_model.cfg.normalization_type = None prompt = torch.tensor([1,2,3,4], device="cpu") logits = hooked_model(prompt) print(logits.shape) print(logits[0, 0, :10])

Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions

StefanHex8mo10

I really like the investigation into properties of SAE features, especially the angle of testing whether SAE features have particular properties than other (random) directions don't have!

Random directions as a baseline: Based on my experience here I expect random directions to be a weak baseline. For example the covariance matrix of model activations (or SAE features) is very non-uniform. I'd second @Hoagy's suggestion of linear combination of SAE features, or direction towards other model activations as I used here.

Ablation vs functional FT-LLC: I found t... (read more)

TurnTrout's shortform feed

StefanHex8mo30

I like this idea! I'd love to see checks of this on the SOTA models which tend to have lots of layers (thanks @Joseph Miller for running the GPT2 experiment already!).

I notice this line of argument would also imply that the embedding information can only be accessed up to a certain layer, after which it will be washed out by the high-norm outputs of layers. (And the same for early MLP layers which are rumoured to act as extended embeddings in some models.) -- this seems unexpected.

Additionally, they would be further evidence (but not conclusive^[2]) towards

... (read more)

The $100B plan with "70% risk of killing us all" w Stephen Fry [video]

StefanHex8mo10

I know he’s legitimately affiliated with that YT channel

Can I ask how you know that? The amount of "w Stephen Fry" video titles made me suspicious, and I wondered whether it's AI generated and not Stephen-Fry-endorsed, but I haven't done any further research.

Edit: A colleague just pointed out that other videos are up to 7 years old (and AI voice wasn't this good then), so in that video the voice must be real

2TeaTieAndHat8mo

Apparently, he co-founded the channel. But of course he might have had his voiced faked just for this video, as some suggested in the comments to it.

StefanHex's Shortform

StefanHex8mo30

Has anyone tested whether feature splitting can be explained by composite (non-atomic) features?

Feature splitting is the observation that SAEs with larger dictionary size find features that are geometrically (cosine similarity) and semantically (activating dataset examples) similar. In particular, a larger SAE might find multiple features that are all similar to each other, and to a single feature found in a smaller SAE.
- Anthropic gives the example of the feature " 'the' in mathematical prose" which splits into features " 'the' in mathematics

... (read more)

1RGRGRG7mo

I like this recent post about atomic meta-SAE features, I think these are much closer (compared against normal SAEs) to what I expect atomic units to look like: https://www.lesswrong.com/posts/TMAmHh4DdMr4nCSr5/showing-sae-latents-are-not-atomic-using-meta-saes

I found >800 orthogonal "write code" steering vectors

StefanHex8mo62

But there is still a mystery I don't fully understand: how is it possible to find so many "noise" vectors that don't influence the output of the network much.

In unrelated experiments I found that steering into a (uniform) random direction is much less effective, than steering into a random direction sampled with same covariance as the real activations. This suggests that there might be a lot of directions^[1] that don't influence the output of the network much. This was on GPT2 but I'd expect it to generalize for other Transformers.

^{^}
Though I don't know

StefanHex8mo112

Hmm, with that we'd need $δ \leq 0.05$ to get 800 orthogonal vectors.^[1] This seems pretty workable. If we take the MELBO vector magnitude change (7 -> 20) as an indication of how much the cosine similarity changes, then this is consistent with $δ = 0.15$ for the original vector. This seems plausible for a steering vector?

^{^}
Thanks to @Lucius Bushnaq for correcting my earlier wrong number

Lucius Bushnaq's Shortform

StefanHex9mo10

That model has an Attention and MLP block (GPT2-style model with 1 layer but a bit wider, 21M params).

I changed my mind over the course of this morning. TheTinyStories models' language isn't that bad, and I think it'd be a decent research project to try to fully understand one of these.

I've been playing around with the models this morning, quotes from the 1-layer model:

Once upon a time, there was a lovely girl called Chloe. She loved to go for a walk every morning and one day she came across a road.
One day, she decided she wanted to go for a ride. She jump

... (read more)

2RogerDearnaley9mo

Yup: the 1L model samples are full of non-sequiturs, to the level I can't imagine a human child telling a story that badly; whereas the first 2L model example has maybe one non-sequitur/plot jump (the way the story ignores the content of bird's first line of dialog), which the rest of the story then works into it so it ends up almost making sense, in retrospect (except it would have made better sense if the bear had said that line). They second example has a few non-sequiturs, but they're again not glaring and continuous the way the 1L output is. (As a parent) I can imagine a rather small human child telling a story with about the 2L level of plot inconsistencies.

Lucius Bushnaq's Shortform

StefanHex9mo30

The tiny story status seems quite simple, in the sense that I can see how you could provide TinyStories levels of loss by following simple rules plus a bunch of memorization.

Empirically, one of the best models in the tiny stories paper is a super wide 1L transformer, which basically is bigrams, trigrams, and slightly more complicated variants [see Bucks post] but nothing that requires a step of reasoning.

I am actually quite uncertain where the significant gap between TinyStories, GPT-2 and GPT-4 is. Maybe I could fully understand TinyStories-1L if I tried, would this tell us about GPT-4? I feel like the result for TinyStories will be a bunch of heuristics.

2RogerDearnaley9mo

From rereading the Tiny Stories paper, the 1L model did a really bad job of maintaining the internal consistency of the story and figuring out and allowing for the logical consequences of events, but otherwise did a passably good job of speaking coherent childish English. So the choice on transformer block count would depend on how interested you are in learning how to speak English that is coherent as well as grammatical. Personally I'd probably want to look at something in the 3–4-layer range, so it has an input layer, and output layer, and at least one middle layer, and might actually contain some small circuits. I would LOVE to have an automated way of converting a Tiny Stories-size transformer to some form of declarative language spaghetti code. It would probably help to start with a heavily-quantized version. For example, a model trained using the techniques of the recent paper on building AI using trinary logic (so roughly a 1.6-bit quantization, and eliminating matrix multiplication entirely) might be a good place to start, combined with the sort of techniques the model-pruning folks have been working on for which model-internal interactions are important on the training set and which are just noise and can be discarded. I strongly suspect that every transformer model is just a vast pile of heuristics. In certain cases, if trained on a situation that genuinely is simple and has a specific algorithm to solve it runnable during a model forward-pass (like modular arithmetic, for example), and with enough data to grok it, then the resulting heuristic may actually be an elegant True Name algorithm for the problem. Otherwise, it's just going to be a pile of heuristics that SGD found and tuned. Fortunately SGD (for reasons that singular learning theory illuminates) has a simplicity bias that gives a prior that acts like Occam's Razor or a Kolmogorov Complexity prior, so tends to prefer algorithms that generalize well (especially as the amount of data tends to inf

3jow9mo

Is that TinyStories model a super-wide attention-only transformer (the topic of the mechanistic interp work and Buck’s post you cite). I tried to figure it out briefly and couldn’t tell, but I bet it isn’t, and instead has extra stuff like an MLP block. Regardless, in my view it would be a big advance to really understand how the TinyStories models work. Maybe they are “a bunch of heuristics” but maybe that’s all GPT-4, and our own minds, are as well…