All of StefanHex's Comments + Replies

The previous lines calculate the ratio (or 1-ratio) stored in the “explained variance” key for every sample/batch. Then in that later quoted line, the list is averaged, I.e. we”re taking the sample average over the ratio. That’s the FVU_B formula.

Let me know if this clears it up or if we’re misunderstanding each other!

I think this is the sum over the vector dimension, but not over the samples. The sum (mean) over samples is taken later in this line which happens after the division

        metrics[f"{metric_name}"] = torch.cat(metric_values).mean().item()
1Archimedes
Let's suppose that's the case. I'm still not clear on how are you getting to FVU_B?

I think this is the sum over the vector dimension, but not over the samples. The sum (mean) over samples is taken later in this line which happens after the division

        metrics[f"{metric_name}"] = torch.cat(metric_values).mean().item()

Edit: And to clarify, my impression is that people think of this as alternative definitions of FVU and you got to pick one, rather than one being right and one being a bug.

Edit2: And I'm in touch with the SAEBench authors about making a PR to change this / add both options (and by extension probably doing the same in S... (read more)

4Gurkenglas
Ah, oops. I think I got confused by the absence of L_2 syntax in your formula for FVU_B. (I agree that FVU_A is more principled ^^.)
StefanHex*272

PSA: People use different definitions of "explained variance" / "fraction of variance unexplained" (FVU)

 is the formula I think is sensible; the bottom is simply the variance of the data, and the top is the variance of the residuals. The  indicates the  norm over the dimension of the vector . I believe it matches Wikipedia's definition of FVU and R squared.

 is the formula used by SAELens and SAEBench. It seems less pri... (read more)

1Archimedes
FVU_B doesn't make sense but I don't see where you're getting FVU_B from. Here's the code I'm seeing: resid_sum_of_squares = ( (flattened_sae_input - flattened_sae_out).pow(2).sum(dim=-1) ) total_sum_of_squares = ( (flattened_sae_input - flattened_sae_input.mean(dim=0)).pow(2).sum(-1) ) mse = resid_sum_of_squares / flattened_mask.sum() explained_variance = 1 - resid_sum_of_squares / total_sum_of_squares Explained variance = 1 - FVU = 1 - (residual sum of squares) / (total sum of squares)
8notfnofn
I would be very surprised if this FVU_B actually another definition and not a bug. It's not a fraction of the variance and those denominators can easily be zero or very near zero.
2Gurkenglas
https://github.com/jbloomAus/SAELens/blob/main/sae_lens/evals.py#L511 sums the numerator and denominator separately, if they aren't doing that in some other place probably just file a bug report?

Same plot but using SAEBench's FVU definition. Matches this Neuronpedia page.

I'm going to update the results in the top-level comment with the corrected data; I'm pasting the original figures here for posterity / understanding the past discussion. Summary of changes:

  1. [Minor] I didn't subtract the mean in the variance calculation. This barely had an effect on the results.
  2. [Major] I used a different definition of "Explained Variance" which caused a pretty large difference

Old (no longer true) text:

It turns out that even clustering (essentially L_0=1) explains up to 90% of the variance in activations, being matched only by SAEs with L_0&

... (read more)
StefanHex*40

After adding the mean subtraction, the numbers haven't changed too much actually -- but let me make sure I'm using the correct calculation. I'm gonna follow your and @Adam Karvonen's suggestion of using the SAE bench code and loading my clustering solution as an SAE (this code).

These logs show numbers with the original / corrected explained variance computation; the difference is in the 3-8% range.

v3 (KMeans): Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=4096, variance explained = 0.8887 / 0.8568
v3 (KMeans): Layer blocks.3.hook_resid_post,
... (read more)
StefanHex*20

You're right. I forgot subtracting the mean. Thanks a lot!!

I'm computing new numbers now, but indeed I expect this to explain my result! (Edit: Seems to not change too much)

4StefanHex
After adding the mean subtraction, the numbers haven't changed too much actually -- but let me make sure I'm using the correct calculation. I'm gonna follow your and @Adam Karvonen's suggestion of using the SAE bench code and loading my clustering solution as an SAE (this code). These logs show numbers with the original / corrected explained variance computation; the difference is in the 3-8% range. v3 (KMeans): Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=4096, variance explained = 0.8887 / 0.8568 v3 (KMeans): Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=16384, variance explained = 0.9020 / 0.8740 v3 (KMeans): Layer blocks.4.hook_resid_post, n_tokens=1000000, n_clusters=4096, variance explained = 0.8044 / 0.7197 v3 (KMeans): Layer blocks.4.hook_resid_post, n_tokens=1000000, n_clusters=16384, variance explained = 0.8261 / 0.7509 PCA+Clustering: Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=4095, n_pca=1, variance explained = 0.8910 / 0.8599 PCA+Clustering: Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=16383, n_pca=1, variance explained = 0.9041 / 0.8766 PCA+Clustering: Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=4094, n_pca=2, variance explained = 0.8948 / 0.8647 PCA+Clustering: Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=16382, n_pca=2, variance explained = 0.9076 / 0.8812 PCA+Clustering: Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=4091, n_pca=5, variance explained = 0.9044 / 0.8770 PCA+Clustering: Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=16379, n_pca=5, variance explained = 0.9159 / 0.8919 PCA+Clustering: Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=4086, n_pca=10, variance explained = 0.9121 / 0.8870 PCA+Clustering: Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=16374, n_pca=10, variance explained = 0.9232 / 0.9012 PCA+Clustering: Layer blocks.3.hook_resid_post, n_tokens=1000000, n_clusters=4076, n_pc

I should really run a random Gaussian data baseline for this.

Tentatively I get similar results (70-85% variance explained) for random data -- I haven't checked that code at all though, don't trust this. Will double check this tomorrow.

(In that case SAE's performance would also be unsurprising I suppose)

[This comment is no longer endorsed by its author]Reply

If we imagine that the meaning is given not by the dimensions of the space but rather by regions/points/volumes of the space

I think this is what I care about finding out. If you're right this is indeed not surprising nor an issue, but you being right would be a major departure from the current mainstream interpretability paradigm(?).

The question of regions vs compositionality is what I've been investigating with my mentees recently, and pretty keen on. I'll want to write up my current thoughts on this topic sometime soon.

What do you mean you’re encoding/decoding like normal but using the k means vectors?

So I do something like

        latents_tmp = torch.einsum("bd,nd->bn", data, centroids)
        max_latent = latents_tmp.argmax(dim=-1)  # shape: [batch]
        latents = one_hot(max_latent)

where the first line is essentially an SAE embedding (and centroids are the features), and the second/third line is a top-k. And for reconstruction do something like

    recon = centroids @ latents

which should also be equivalent.

Shouldn’t the SAE training process for a top k

... (read more)
5JoshEngels
I just tried to replicate this on GPT-2 with expansion factor 4 (so total number of centroids = 768 * 4). I get that clustering recovers  ~87% fraction of variance explained, while a k = 32 SAE gets more like 95% variance explained. I did the nonlinear version of finding nearest neighbors when using k means to give k means the biggest advantage possible, and did k-means clustering on points using the FAISS clustering library.  Definitely take this with a grain of salt, I'm going to look through my code and see if I can reproduce your results on pythia too, and if so try on a larger model to. Code: https://github.com/JoshEngels/CheckClustering/tree/main

I'm not sure what you mean by "K-means clustering baseline (with K=1)". I would think the K in K-means stands for the number of means you use, so with K=1, you're just taking the mean direction of the weights. I would expect this to explain maybe 50% of the variance (or less), not 90% of the variance.

Thanks for pointing this out! I confused nomenclature, will fix!

Edit: Fixed now. I confused

  • the number of clusters ("K") / dictionary size
  • the number of latents ("L_0" or k in top-k SAEs). Some clustering methods allow you to assign multiple clusters to on
... (read more)
StefanHex122

this seems concerning.

I feel like my post appears overly dramatic; I'm not very surprised and don't consider this the strongest evidence against SAEs. It's an experiment I ran a while ago and it hasn't changed my (somewhat SAE-sceptic) stance much.

But this is me having seen a bunch of other weird SAE behaviours (pre-activation distributions are not the way you'd expect from the superposition hypothesis h/t @jake_mendel, if you feed SAE-reconstructed activations back into the encoder the SAE goes nuts, stuff mentioned in recent Apollo papers, ...).


Reasons t... (read more)

2StefanHex
Tentatively I get similar results (70-85% variance explained) for random data -- I haven't checked that code at all though, don't trust this. Will double check this tomorrow. (In that case SAE's performance would also be unsurprising I suppose)
2Alexander Gietelink Oldenziel
Is there a benchmark in which SAEs clearly, definitely outperform standard techniques?
StefanHex*46-1

Edited to fix errors pointed out by @JoshEngels and @Adam Karvonen (mainly: different definition for explained variance, details here).

Summary: K-means explains 72 - 87% of the variance in the activations, comparable to vanilla SAEs but less than better SAEs. I think this (bug-fixed) result is neither evidence in favour of SAEs nor against; the Clustering & SAE numbers make a straight-ish line on a log plot.

Epistemic status: This is a weekend-experiment I ran a while ago and I figured I should write it up to share. I have taken decent care to check my ... (read more)

2StefanHex
Same plot but using SAEBench's FVU definition. Matches this Neuronpedia page.
2StefanHex
I'm going to update the results in the top-level comment with the corrected data; I'm pasting the original figures here for posterity / understanding the past discussion. Summary of changes: 1. [Minor] I didn't subtract the mean in the variance calculation. This barely had an effect on the results. 2. [Major] I used a different definition of "Explained Variance" which caused a pretty large difference Old (no longer true) text:
1Andrew Mack
I think the relation between K-means and sparse dictionary learning (essentially K-means is equivalent to an L_0=1 constraint) is already well-known in the sparse coding literature? For example see this wiki article on K-SVD (a sparse dictionary learning algorithm) which first reviews this connection before getting into the nuances of k-SVD. Were the SAEs for this comparison trained on multiple passes through the data, or just one pass/epoch? Because if for K-means you did multiple passes through the data but for SAEs just one then this feels like an unfair comparison.

I was having trouble reproducing your results on Pythia, and was only able to get 60% variance explained. I may have tracked it down: I think you may be computing FVU incorrectly. 

https://gist.github.com/Stefan-Heimersheim/ff1d3b92add92a29602b411b9cd76cec#file-clustering_pythia-py-L309

I think FVU is correctly computed by subtracting the mean from each dimension when computing the denominator. See the SAEBench impl here:

https://github.com/adamkarvonen/SAEBench/blob/5204b4822c66a838d9c9221640308e7c23eda00a/sae_bench/evals/core/main.py#L566

When I used yo... (read more)

4tailcalled
I'm not sure what you mean by "K-means clustering baseline (with K=1)". I would think the K in K-means stands for the number of means you use, so with K=1, you're just taking the mean direction of the weights. I would expect this to explain maybe 50% of the variance (or less), not 90% of the variance. But anyway, under my current model (roughly Why I'm bearish on mechanistic interpretability: the shards are not in the network + Binary encoding as a simple explicit construction for superposition) it seems about as natural to use K-means as it does to use SAEs, and not necessarily an issue if K-means outperforms SAEs. If we imagine that the meaning is given not by the dimensions of the space but rather by regions/points/volumes of the space, then K-means seems like a perfectly cromulent quantization for identifying these volumes. The major issue is where we go from here.
1JoshEngels
What do you mean you’re encoding/decoding like normal but using the k means vectors? Shouldn’t the SAE training process for a top k SAE with k = 1 find these vectors then?  In general I’m a bit skeptical that clustering will work as well on larger models, my impression is that most small models have pretty token level features which might be pretty clusterable with k=1, but for larger models many activations may belong to multiple “clusters”, which you need dictionary learning for. 
8Alexander Gietelink Oldenziel
this seems concerning. Can somebody ELI5 what's going on here?
StefanHex1513

I’ve just read the article, and found it indeed very thought provoking, and I will be thinking more about it in the days to come.

One thing though I kept thinking: Why doesn’t the article mention AI Safety research much?

In the passage

The only policy that AI Doomers mostly agree on is that AI development should be slowed down somehow, in order to “buy time.”

I was thinking: surely most people would agree on policies like “Do more research into AI alignment” / “Spend more money on AI Notkilleveryoneism research”?

In general the article frames the policy to ... (read more)

6Davidmanheim
Because almost all of current AI safety research can't make future agentic ASI that isn't already aligned with human values safe, as everyone who has looked at the problem seems to agree. And the Doomers certainly have been clear about this, even as most of the funding goes to prosaic alignment.

Thanks for writing these up! I liked that you showed equivalent examples in different libraries, and included the plain “from scratch” version.

Hmm, I think I don't fully understand your post. Let me summarize what I get, and what is confusing me:

  • I absolutely get the "there are different levels / scales of explaining a network" point
  • It also makes sense to tie this to some level of loss. E.g. explain GPT-2 to a loss level of L=3.0 (rather than L=2.9), or explain IOI with 95% accuracy.
  • I'm also a fan of expressing losses in terms of compute or model size ("SAE on Claude 5 recovers Claude 2-levels of performance").

I'm confused whether your post tries to tell us (how to determine) what loss our interpr... (read more)

2Dmitry Vaintrob
Thanks for the questions!  Sorry, I think the context of the Watanabe scale is a bit confusing. I'm saying that in fact it's the wrong scale to use as a "natural scale". The Watanabe scale depends only on the number of training datapoints, and doesn't notice any other properties of your NN or your phenomenon of interest.  Roughly, the Watanabe scale is the scale on which loss improves if you memorize a single datapoint (so memorizing improves accuracy by 1/n with n = #(training set) and in a suitable operationalization, improves loss by O(logn/n), and this is the Watanabe scale).  It's used in SLT roughly because it's the minimal temperature scale where "memorization doesn't count as relevant", and so relevant measurements become independent of the n-point sample. However in most interp experiments, the realistic loss reconstruction loss reconstruction is much rougher (i.e., further from optimal loss) than the 1/n scale where memorization becomes an issue (even if you conceptualize #(training set) as some small synthetic training set that you were running the experiment on). For your second question: again, what I wrote is confusing and I really want to rewrite it more clearly later. I tried to clarify what I think you're asking about in this shortform. Roughly, the point here is that to avoid having your results messed up by spurious behaviors, you might want to degrade as much as possible while still observing the effect of your experiment. The idea is that if you found any degradation that wasn't explicitly designed with your experiment in mind (i.e., is natural), but where you see your experimental results hold, then you have "found a phenomenon". The hope is that if you look at the roughest such scale, you might kill enough confounders and interactions to make your result be "clean" (or at least cleaner): so for example optimistically you might hope to explain all the loss of the degraded model at the degradation scale you chose (whereas at other scales, th

Great read! I think you explained well the intuition why logits / logprobs are so natural (I haven't managed to do this well in a past attempt). I like the suggestion that (a) NNs consist of parallel mechanisms to get the answer, and (b) the best way to combine multiple predictions is via adding logprobs.

I haven't grokked your loss scales explanation (the "interpretability insights" section) without reading your other post though.

2Dmitry Vaintrob
Thanks! Not saying anything deep here. The point is just that you might have two cartoon pictures: 1. every correctly classified input is either the result of a memorizing circuit or of a single coherent generalizing circuit behavior. If you remove a single generalizing circuit, your accuracy will degrade additively. 2. a correctly classified input is the result of a "combined" circuit consisting of multiple parallel generalizing "subprocesses" giving independent predictions, and if you remove any of these subprocesses, your accuracy will degrade multiplicatively. A lot of ML work only thinks about picture #1 (which is the natural picture to look at if you only have one generalizing circuit and every other circuit is a memorization). But the thing I'm saying is that picture #2 also occurs, and in some sense is "the info-theoretic default" (though both occur simultaneously -- this is also related to the ideas in this post)  

Keen on reading those write-ups, I appreciate the commitment!

Simultaneously; as they lead to separate paths both of which are needed as inputs for the final node.

StefanHex*91

List of some larger mech interp project ideas (see also: short and medium-sized ideas). Feel encouraged to leave thoughts in the replies below!

Edit: My mentoring doc has more-detailed write-ups of some projects. Let me know if you're interested!

What is going on with activation plateaus: Transformer activations space seems to be made up of discrete regions, each corresponding to a certain output distribution. Most activations within a region lead to the same output, and the output changes sharply when you move from one region to another. The boundaries seem... (read more)

StefanHex*60

List of some medium-sized mech interp project ideas (see also: shorter and longer ideas). Feel encouraged to leave thoughts in the replies below!

Edit: My mentoring doc has more-detailed write-ups of some projects. Let me know if you're interested!

Toy model of Computation in Superposition: The toy model of computation in superposition (CIS; Circuits-in-Sup, Comp-in-Sup post / paper) describes a way in which NNs could perform computation in superposition, rather than just storing information in superposition (TMS). It would be good to have some actually trai... (read more)

StefanHex*40

List of some short mech interp project ideas (see also: medium-sized and longer ideas). Feel encouraged to leave thoughts in the replies below!

Edit: My mentoring doc has more-detailed write-ups of some projects. Let me know if you're interested!

Directly testing the linear representation hypothesis by making up a couple of prompts which contain a few concepts to various degrees and test

  • Does the model indeed represent intensity as magnitude? Or are there separate features for separately intense versions of a concept? Finding the right prompts is tricky, e.g.
... (read more)

CLDR (Cross-layer distributed representation): I don't think Lee has written his up anywhere yet so I've removed this for now.

Also, just wanted to flag that the links on 'this picture' and 'motivation image' don't currently work.

Thanks for the flag! It's these two images, I realize now that they don't seem to have direct links

Images taken from AMFTC and Crosscoders by Anthropic.

Thanks for the comment!

I think this is what most mech interp researchers more or less think. Though I definitely expect many researchers would disagree with individual points, nor does it fairly weigh all views and aspects (it's very biased towards "people I talk to"). (Also this is in no way an Apollo / Apollo interp team statement, just my personal view.)

Thanks! You're right, totally mixed up local and dense / distributed. Decided to just leave out that terminology

StefanHex114

Why I'm not too worried about architecture-dependent mech interp methods:

I've heard people argue that we should develop mechanistic interpretability methods that can be applied to any architecture. While this is certainly a nice-to-have, and maybe a sign that a method is principled, I don't think this criterion itself is important.

I think that the biggest hurdle for interpretability is to understand any AI that produces advanced language (>=GPT2 level). We don't know how to write a non-ML program that speaks English, let alone reason, and we have no ide... (read more)

5Lucius Bushnaq
Agreed. I do value methods being architecture independent, but mostly just because of this:  At scale, different architectures trained on the same data seem to converge to learning similar algorithms to some extent. I care about decomposing and understanding these algorithms, independent of the architecture they happen to be implemented on. If a mech interp method is formulated in a mostly architecture independent manner, I take that as a weakly promising sign that it's actually finding the structure of the learned algorithm, instead of structure related to the implementation on one particular architecture.
3bilalchughtai
Agreed. A related thought is that we might only need to be able to interpret a single model at a particular capability level to unlock the safety benefits, as long as we can make a sufficient case that we should use that model. We don't care inherently about interpreting GPT-4, we care about there existing a GPT-4 level model that we can interpret.
4Jozdien
I think the usual reason this claim is made is because the person making the claim thinks it's very plausible LLMs aren't the paradigm that lead to AGI. If that's the case, then interpretability that's indexed heavily on them gets us understanding of something qualitatively weaker than we'd like. I agree that there'll be some transfer, but it seems better and not-very-hard to talk about how well different kinds of work transfer.

Why I'm not that hopeful about mech interp on TinyStories models:

Some of the TinyStories models are open source, and manage to output sensible language while being tiny (say 64dim embedding, 8 layers). Maybe it'd be great to try and thoroughly understand one of those?

I am worried that those models simply implement a bunch of bigrams and trigrams, and that all their performance can be explained by boring statistics & heuristics. Thus we would not learn much from fully understanding such a model. Evidence for this is that the 1-layer variant, which due t... (read more)

StefanHex*40-1

Collection of some mech interp knowledge about transformers:

Writing up folk wisdom & recent results, mostly for mentees and as a link to send to people. Aimed at people who are already a bit familiar with mech interp. I've just quickly written down what came to my head, and may have missed or misrepresented some things. In particular, the last point is very brief and deserves a much more expanded comment at some point. The opinions expressed here are my own and do not necessarily reflect the views of Apollo Research.

Transformers take in a sequence of t... (read more)

3Rauno Arike
This is a nice overview, thanks! I don't think I've seen the CLDR acronym before, are the arguments publicly written up somewhere? Also, just wanted to flag that the links on 'this picture' and 'motivation image' don't currently work.
3aribrill
Thanks for the great writeup. Typo: I think you meant to write distributed, not local, codes. A local code is the opposite of superposition.
3[anonymous]
Who is "we"? Is it: 1. only you and your team? 2. the entire Apollo Research org? 3. the majority of mechinterp researchers worldwide? 4. some other group/category of people? Also, this definitely deserves to be made into a high-level post, if you end up finding the time/energy/interest in making one.
2Matt Goldenberg
this is great, thanks for sharing

Thanks for the nice writeup! I'm confused about why you can get away without interpretation of what the model components are:

In cases where we worry that our model learned a human-simulator / camera-simulator rather than actually predicting whether the diamond exists, wouldn't circuit discovery simply give us the human-simulator circuit? (And thus causal scrubbing doesn't save us.) I'm thinking in particular of cases where the human-simulator is easier to learn than the intended solution.

Of course if you had good interpretability, a way to realise whether ... (read more)

2Erik Jenner
You're totally right that this is an important difficulty I glossed over, thanks! TL;DR: I agree you need some extra ingredient to deal with cases where (AI-augmented) humans can't supervise, and this ingredient could be interpretability. On the other hand, there's at least one (somewhat speculative) alternative to interp (and MAD is also potentially useful if you can only deal with cases humans can supervise with enough effort, e.g., to defend against scheming). ---------------------------------------- Just to restate things a bit, I'd distinguish two cases: * "In-distribution anomaly detection:" we are fine with flagging any input as "anomalous" that's OOD compared to the trusted distribution * "Off-distribution anomaly detection:" there are some inputs that are OOD but that we still want to classify as "normal" In-distribution anomaly detection can already be useful (mainly to deal with rare high-stakes failures). For example, if a human can verify that no tampering occurred with enough effort, then we might be able to create a trusted distribution that covers so many cases that we're fine with flagging everything that's OOD. But  we might still want off-distribution anomaly detection, where the anomaly detector generalizes as intended from easy trusted examples to harder untrusted examples. Then we need some additional ingredient to make that generalization work. Paul writes about one approach specifically for measurement tampering here and in the following subsection. Exlusion finetuning (appendix I in Redwood's measurement tampering paper) is a practical implementation of a similar intuition. This does rely on some assumptions about inductive bias, but at least seems more promising to me than just hoping to get a direct translator from normal training. I think ARC might have hopes to solve ELK more broadly (rather than just measurement tampering), but I understand those less (and maybe they're just "use a measurement tampering detector to bootstrap to

Paper link: https://arxiv.org/abs/2407.20311

(I have neither watched the video nor read the paper yet, just in case someone else was looking for the non-video version)

[…] no reason to be concentrated in any one spot of the network (whether activation-space or weight-space). So studying weights and activations is pretty doomed.

I find myself really confused by this argument. Shards (or anything) do not need to be “concentrated in one spot” for studying them to make sense?

As Neel and Lucius say, you might study SAE latents or abstractions built on the weights, no one requires (or assumes) than things are concentrated in one spot.

Or to make another analogy, one can study neuroscience even though things are not concentrat... (read more)

2tailcalled
I don't doubt you can find may facts about SAE latents, I just don't think they will be relevant for anything that matters. I'm by-default bearish on neuroscience too, though it's more nuanced there. Feeding the output into the input isn't much thinking. It just allows the thinking to occur in a very diffuse way.

Even after reading this (2 weeks ago), I today couldn't manage to find the comment link and manually scrolled down. I later noticed it (at the bottom left) but it's so far away from everything else. I think putting it somewhere at the top near the rest of the UI would be much easier for me

4habryka
Yeah, we'll probably make that adjustment soon. I also currently think the comment link is too hidden, even after trying to get used to it for a while.

I would like the following subscription: All posts with certain tags, e.g. all [AI] posts or all [Interpretability (ML & AI)] posts.

I just noticed (and enabled) a “subscribe” feature in the page for the tag, it says “Get notifications when posts are added to this tag.” — I’m unsure if those are emails, but assuming they are, my problem is solved. I never noticed this option before.

2Raemon
I think by default they are site-notifications, but in your user settings you can change them to emails.
StefanHex*Ω120

And here's the code to do it with replacing the LayerNorms with identities completely:

import torch
from transformers import GPT2LMHeadModel
from transformer_lens import HookedTransformer

model = GPT2LMHeadModel.from_pretrained("apollo-research/gpt2_noLN").to("cpu")

# Undo my hacky LayerNorm removal
for block in model.transformer.h:
    block.ln_1.weight.data = block.ln_1.weight.data / 1e6
    block.ln_1.eps = 1e-5
    block.ln_2.weight.data = block.ln_2.weight.data / 1e6
    block.ln_2.eps = 1e-5
model.transformer.ln_f.weight.data = model.transformer.ln_
... (read more)
3Quiche Eater
You should also set model.cfg.normalization_type = None afterwards. It's mostly a formality since you're doing it after initialization. ActivationCache.apply_ln_to_stack() is the only function I found which behaves incorrectly if you don't change this.
2Logan Riggs
And here's the code to convert it to NNsight (Thanks Caden for writing this awhile ago!) import torch from transformers import GPT2LMHeadModel from transformer_lens import HookedTransformer from nnsight.models.UnifiedTransformer import UnifiedTransformer model = GPT2LMHeadModel.from_pretrained("apollo-research/gpt2_noLN").to("cpu") # Undo my hacky LayerNorm removal for block in model.transformer.h: block.ln_1.weight.data = block.ln_1.weight.data / 1e6 block.ln_1.eps = 1e-5 block.ln_2.weight.data = block.ln_2.weight.data / 1e6 block.ln_2.eps = 1e-5 model.transformer.ln_f.weight.data = model.transformer.ln_f.weight.data / 1e6 model.transformer.ln_f.eps = 1e-5 # Properly replace LayerNorms by Identities def removeLN(transformer_lens_model): for i in range(len(transformer_lens_model.blocks)): transformer_lens_model.blocks[i].ln1 = torch.nn.Identity() transformer_lens_model.blocks[i].ln2 = torch.nn.Identity() transformer_lens_model.ln_final = torch.nn.Identity() hooked_model = HookedTransformer.from_pretrained("gpt2", hf_model=model, fold_ln=True, center_unembed=False).to("cpu") removeLN(hooked_model) model_nnsight = UnifiedTransformer(model="gpt2", hf_model=model, fold_ln=True, center_unembed=False).to("cpu") removeLN(model_nnsight) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") prompt = torch.tensor([1,2,3,4], device=device) logits = hooked_model(prompt) with torch.no_grad(), model_nnsight.trace(prompt) as runner: logits2 = model_nnsight.unembed.output.save() logits, cache = hooked_model.run_with_cache(prompt) torch.allclose(logits, logits2)
StefanHexΩ260

Here's a quick snipped to load the model into TransformerLens!

import torch
from transformers import GPT2LMHeadModel
from transformer_lens import HookedTransformer

model = GPT2LMHeadModel.from_pretrained("apollo-research/gpt2_noLN").to("cpu")
hooked_model = HookedTransformer.from_pretrained("gpt2", hf_model=model, fold_ln=False, center_unembed=False).to("cpu")
# Kill the LayerNorms because TransformerLens overwrites eps
for block in hooked_model.blocks:
    block.ln1.eps = 1e12
    block.ln2.eps = 1e12
hooked_model.ln_final.eps = 1e12

# Make sure the outp
... (read more)
2StefanHex
And here's the code to do it with replacing the LayerNorms with identities completely: import torch from transformers import GPT2LMHeadModel from transformer_lens import HookedTransformer model = GPT2LMHeadModel.from_pretrained("apollo-research/gpt2_noLN").to("cpu") # Undo my hacky LayerNorm removal for block in model.transformer.h: block.ln_1.weight.data = block.ln_1.weight.data / 1e6 block.ln_1.eps = 1e-5 block.ln_2.weight.data = block.ln_2.weight.data / 1e6 block.ln_2.eps = 1e-5 model.transformer.ln_f.weight.data = model.transformer.ln_f.weight.data / 1e6 model.transformer.ln_f.eps = 1e-5 # Properly replace LayerNorms by Identities class HookedTransformerNoLN(HookedTransformer): def removeLN(self): for i in range(len(self.blocks)): self.blocks[i].ln1 = torch.nn.Identity() self.blocks[i].ln2 = torch.nn.Identity() self.ln_final = torch.nn.Identity() hooked_model = HookedTransformerNoLN.from_pretrained("gpt2", hf_model=model, fold_ln=True, center_unembed=False).to("cpu") hooked_model.removeLN() hooked_model.cfg.normalization_type = None prompt = torch.tensor([1,2,3,4], device="cpu") logits = hooked_model(prompt) print(logits.shape) print(logits[0, 0, :10])

I really like the investigation into properties of SAE features, especially the angle of testing whether SAE features have particular properties than other (random) directions don't have!

Random directions as a baseline: Based on my experience here I expect random directions to be a weak baseline. For example the covariance matrix of model activations (or SAE features) is very non-uniform. I'd second @Hoagy's suggestion of linear combination of SAE features, or direction towards other model activations as I used here.

Ablation vs functional FT-LLC: I found t... (read more)

I like this idea! I'd love to see checks of this on the SOTA models which tend to have lots of layers (thanks @Joseph Miller for running the GPT2 experiment already!).

I notice this line of argument would also imply that the embedding information can only be accessed up to a certain layer, after which it will be washed out by the high-norm outputs of layers. (And the same for early MLP layers which are rumoured to act as extended embeddings in some models.) -- this seems unexpected.

Additionally, they would be further evidence (but not conclusive[2]) towards

... (read more)

I know he’s legitimately affiliated with that YT channel

Can I ask how you know that? The amount of "w Stephen Fry" video titles made me suspicious, and I wondered whether it's AI generated and not Stephen-Fry-endorsed, but I haven't done any further research.

Edit: A colleague just pointed out that other videos are up to 7 years old (and AI voice wasn't this good then), so in that video the voice must be real

2TeaTieAndHat
Apparently, he co-founded the channel. But of course he might have had his voiced faked just for this video, as some suggested in the comments to it.

Has anyone tested whether feature splitting can be explained by composite (non-atomic) features?

  • Feature splitting is the observation that SAEs with larger dictionary size find features that are geometrically (cosine similarity) and semantically (activating dataset examples) similar. In particular, a larger SAE might find multiple features that are all similar to each other, and to a single feature found in a smaller SAE.
    • Anthropic gives the example of the feature " 'the' in mathematical prose" which splits into features " 'the' in mathematics
... (read more)
1RGRGRG
I like this recent post about atomic meta-SAE features, I think these are much closer (compared against normal SAEs) to what I expect atomic units to look like: https://www.lesswrong.com/posts/TMAmHh4DdMr4nCSr5/showing-sae-latents-are-not-atomic-using-meta-saes

But there is still a mystery I don't fully understand: how is it possible to find so many "noise" vectors that don't influence the output of the network much.

In unrelated experiments I found that steering into a (uniform) random direction is much less effective, than steering into a random direction sampled with same covariance as the real activations. This suggests that there might be a lot of directions[1] that don't influence the output of the network much. This was on GPT2 but I'd expect it to generalize for other Transformers.

  1. ^

    Though I don't know

... (read more)
StefanHex112

Hmm, with that we'd need  to get 800 orthogonal vectors.[1] This seems pretty workable. If we take the MELBO vector magnitude change (7 -> 20) as an indication of how much the cosine similarity changes, then this is consistent with  for the original vector. This seems plausible for a steering vector?

  1. ^

    Thanks to @Lucius Bushnaq for correcting my earlier wrong number

That model has an Attention and MLP block (GPT2-style model with 1 layer but a bit wider, 21M params).

I changed my mind over the course of this morning. TheTinyStories models' language isn't that bad, and I think it'd be a decent research project to try to fully understand one of these.

I've been playing around with the models this morning, quotes from the 1-layer model:

Once upon a time, there was a lovely girl called Chloe. She loved to go for a walk every morning and one day she came across a road.

One day, she decided she wanted to go for a ride. She jump

... (read more)
2RogerDearnaley
Yup: the 1L model samples are full of non-sequiturs, to the level I can't imagine a human child telling a story that badly; whereas the first 2L model example has maybe one non-sequitur/plot jump (the way the story ignores the content of bird's first line of dialog), which the rest of the story then works into it so it ends up almost making sense, in retrospect (except it would have made better sense if the bear had said that line). They second example has a few non-sequiturs, but they're again not glaring and continuous the way the 1L output is. (As a parent) I can imagine a rather small human child telling a story with about the 2L level of plot inconsistencies.

The tiny story status seems quite simple, in the sense that I can see how you could provide TinyStories levels of loss by following simple rules plus a bunch of memorization.

Empirically, one of the best models in the tiny stories paper is a super wide 1L transformer, which basically is bigrams, trigrams, and slightly more complicated variants [see Bucks post] but nothing that requires a step of reasoning.

I am actually quite uncertain where the significant gap between TinyStories, GPT-2 and GPT-4 is. Maybe I could fully understand TinyStories-1L if I tried, would this tell us about GPT-4? I feel like the result for TinyStories will be a bunch of heuristics.

2RogerDearnaley
From rereading the Tiny Stories paper, the 1L model did a really bad job of maintaining the internal consistency of the story and figuring out and allowing for the logical consequences of events, but otherwise did a passably good job of speaking coherent childish English. So the choice on transformer block count would depend on how interested you are in learning how to speak English that is coherent as well as grammatical. Personally I'd probably want to look at something in the 3–4-layer range, so it has an input layer, and output layer, and at least one middle layer, and might actually contain some small circuits. I would LOVE to have an automated way of converting a Tiny Stories-size transformer to some form of declarative language spaghetti code. It would probably help to start with a heavily-quantized version. For example, a model trained using the techniques of the recent paper on building AI using trinary logic (so roughly a 1.6-bit quantization, and eliminating matrix multiplication entirely) might be a good place to start, combined with the sort of techniques the model-pruning folks have been working on for which model-internal interactions are important on the training set and which are just noise and can be discarded. I strongly suspect that every transformer model is just a vast pile of heuristics. In certain cases, if trained on a situation that genuinely is simple and has a specific algorithm to solve it runnable during a model forward-pass (like modular arithmetic, for example), and with enough data to grok it, then the resulting heuristic may actually be an elegant True Name algorithm for the problem. Otherwise, it's just going to be a pile of heuristics that SGD found and tuned. Fortunately SGD (for reasons that singular learning theory illuminates) has a simplicity bias that gives a prior that acts like Occam's Razor or a Kolmogorov Complexity prior, so tends to prefer algorithms that generalize well (especially as the amount of data tends to inf
3jow
Is that TinyStories model a super-wide attention-only transformer (the topic of the mechanistic interp work and Buck’s post you cite). I tried to figure it out briefly and couldn’t tell, but I bet it isn’t, and instead has extra stuff like an MLP block. Regardless, in my view it would be a big advance to really understand how the TinyStories models work. Maybe they are “a bunch of heuristics” but maybe that’s all GPT-4, and our own minds, are as well…

Thanks for the comment Lawrence, I appreciate it!

  • I agree this doesn't distinguish superposition vs no superposition at all; I was more thinking about the "error correction" aspect of MCIS (and just assuming superposition to be true). But I'm excited too for the SAE application, we got some experiments in the pipeline!
  • Your Correct behaviour point sounds reasonable but I feel like it's not an explanation? I would have the same intuitive expectation, but that doesn't explain how the model manages to not be sensitive. Explanations I can think of in increasing
... (read more)

My core request is that I want (SAE-)features to be a property of the model, rather than the dataset.

  • This can be misunderstood in the sense of taking issue with “If a concept is missing from the SAE training set, the SAE won’t find the corresponding feature.” -- no, this is fine, the model-feature exists but simply isn't found by the SAE.
  • What I mean to say is I take issue if “SAEs find a feature only because this concept is common in the dataset rather than because the model uses this concept.”[1] -- in my books this is SAEs making up features and tha
... (read more)

There is a view that SAE features are just a useful tool for describing activations (interpretable features) and manipulating activations (useful for steering and probing). That SAEs are just a particularly good method in a larger class of methods, but not uniquely principled. In that case I wouldn't expect this connection to model behaviour.

But often we make the claim that we often make is that the model sees and understands the world as a set of model-features, and that we can see the same features by looking at SAE-features of the activations. And then I want to see the extra evidence.

Are the features learned by the model the same as the features learned by SAEs?

TL;DR: I want true features model-features to be a property of the model weights, and to be recognizable without access to the full dataset. Toy models have that property. My “poor man’s model-features” have it. I want to know whether SAE-features have this property too, or if SAE-features do not match the true features model-features.

Introduction: Neural networks likely encode features in superposition. That is, features are represented as directions in... (read more)

1StefanHex
My core request is that I want (SAE-)features to be a property of the model, rather than the dataset. * This can be misunderstood in the sense of taking issue with “If a concept is missing from the SAE training set, the SAE won’t find the corresponding feature.” -- no, this is fine, the model-feature exists but simply isn't found by the SAE. * What I mean to say is I take issue if “SAEs find a feature only because this concept is common in the dataset rather than because the model uses this concept.”[1] -- in my books this is SAEs making up features and that won't help us understand models 1. ^ Of course a concept being common in the model-training-data makes it likely (?) to be a concept the model uses, but I don’t think this is a 1:1 correspondence. (So just making the SAE training set equal to the model training set wouldn’t solve the issue.)
1StefanHex
There is a view that SAE features are just a useful tool for describing activations (interpretable features) and manipulating activations (useful for steering and probing). That SAEs are just a particularly good method in a larger class of methods, but not uniquely principled. In that case I wouldn't expect this connection to model behaviour. But often we make the claim that we often make is that the model sees and understands the world as a set of model-features, and that we can see the same features by looking at SAE-features of the activations. And then I want to see the extra evidence.
Load More