LESSWRONG
LW

All of wesg's Comments + Replies

Activation space interpretability may be doomed

wesg3mo295

This seems like an easy experiment to do!

Here is Sonnet 3.6's 1-shot output (colab) and plot below. I asked for PCA for simplicity.

Looking at the PCs vs x, PC2 is kinda close to giving you x^2, but indeed this is not an especially helpful interpretation of the network.

Good post!

3Louis Jaburi3mo

I played around with the x2 example as well and got similar results. I was wondering why there are two more dominant PCs: If you assume there is no bias, then the activations will all look like λ∗ReLU(E) or λ∗ReLU(−E) and I checked that the two directions found by the PC approximately span the same space as <ReLU(E),ReLU(−E)>. I suspect something similar is happening with bias. In this specific example there is a way to get the true direction w_out from the activations: By doing a PCA on the gradient of the activations. In this case, it is easily explained by computing the gradients by hand: It will be a multiple of w_out.

SAE reconstruction errors are (empirically) pathological

wesg1y30

This was also my hypothesis when I first looked at the table. However, I think this is mostly an illusion. The sample means for rare tokens will have very high standard errors and so it is the case that rare tokens will have both unusually high average KL gap and unusually negative average KL gap mostly. And indeed, the correlation between token frequency and KL gap is approximately 0.

SAE reconstruction errors are (empirically) pathological

wesg1y10

Yes this a good consideration. I think

KL as a metric makes a good tradeoff here by mostly ignoring changes to tokens the original model treated as low probability (as opposed to measuring something more cursed like log prob L2 distance) and so I think captures the more interesting differences.
This motivates having good baselines to determine what this noise floor should be.

SAE reconstruction errors are (empirically) pathological

wesg1y90

This is a great comment! The basic argument makes sense to me, though based on how much variability there is in this plot, I think the story is more complicated. Specifically, I think your theory predicts that the SAE reconstructed KL should always be out on the tail, and these random perturbations should have low variance in their effect on KL.

I will do some follow up experiments to test different versions of this story.

SAE reconstruction errors are (empirically) pathological

wesg1y10

Right, I suppose there could be two reasons scale finetuning works

The L1 penalty reduces the norm of the reconstruction, but does so proportionally across all active features so a ~uniform boost in scale can mostly fix the reconstruction
Due to activation magnitude or frequency or something else, features are inconsistently suppressed and therefore need to be scaled in the correct proportion.

The SAE-norm patch baseline tests (1) but based on your results, the scale factors vary within 1-2x so seems more likely your improvements come more from (2).

I don’t se... (read more)

SAE reconstruction errors are (empirically) pathological

wesg1y10

Yup! I think something like this is probably going on. I blamed this on L1 but this could also be some other learning or architectural failure (eg, not enough capacity):

Some features are dense (or groupwise dense, i.e., frequently co-occur together). Due to the L1 penalty, some of these dense features are not represented. However, for KL it ends up being better to nosily represent all the features than to accurately represent some fraction of them.

Language Models Don't Learn the Physical Manifestation of Language

wesg1y60

Huh I am surprised models fail this badly. That said, I think

We argue that there are certain properties of language that our current large language models (LLMs) don't learn.

is too strong a claim based on your experiments. For instance, these models definitely have representations for uppercase letters:

In my own experiments I have found it hard to get models to answer multiple choice questions. It seems like there may be a disconnect in prompting a model to elicit information which it has in fact learned.

Here is the code to reproduce the ... (read more)

1Jaehyuk Lim1y

What's the difference between "having a representation" for uppercase/lowercase and using the representation to solving MCQ or AB test? From your investigations, do you have intuitions as to what might be the mechanism of disconnect? I'm interested in seeing what might cause these models to perform poorly, despite having representations that seem to be relevant to solving the task, at least to us people. Considering that the tokenizer architecture for Mistral-7B probably includes a case-sensitive dictionary (https://discuss.huggingface.co/t/case-sensitivity-in-mistralai-mistral-7b-v0-1/70031), the presence of distinct representations for uppercase and lowercase characters might not be as relevant to the task for the model as one would assume. It seems plausible that these representations may not substantially influence the model's ability to perform H-Test, such as answering multiple-choice questions, with non-negligible probability. Perhaps one should probe for another representation, such as a circuit for "eliciting information".

1Bruce W. Lee1y

I also want to point you to this (https://arxiv.org/abs/2402.11349, Appendix I, Figure 7, Last Page, "Blueberry?: From Reddit u/AwkwardIllustrator47, r/mkbhd: Was listening to the podcast. Can anyone explain why Chat GPT doesn’t know if R is in the word Blueberry?"). Large model failures on these task types were rather a widely observed phenomenon but with no empirical investigation.

1Bruce W. Lee1y

I appreciate this analysis. I'll take more time to look into this and then get back to write a better reply.

Gemini 1.5 released

wesg1y256

And SORA too: https://openai.com/sora

7Cole Wyeth1y

this is mind blowing. When it works it's crazy, even the "glitches" are bizarre like the real world were a slightly broken video game.

Some additional SAE thoughts

wesg1y41

This is further evidence that there's no single layer at which individual outputs are learned, instead they're smoothly spread across the full set of available layers.
I don't think this simple experiment is by any means decisive, but to me it makes it more likely that features in real models are in large part refined iteratively layer-by-layer, with (more speculatively) the intermediate parts not having any particularly natural representation.

I've also updated more and more in this direction.

I think my favorite explanation/evidence of this in general comes... (read more)

1Hoagy1y

Huh, I'd never seen that figure, super interesting! I agree it's a big issue for SAEs and one that I expect to be thinking about a lot. Didn't have any strong candidate solutions as of writing the post, wouldn't even able to be able to say any thoughts I have on the topic now, sorry. Wish I'd posted this a couple of weeks ago.

AGI safety career advice

wesg2y104

For mechanistic interpretability research, we just released a new paper on neuron interpretability in LLMs, with a large discussion on superposition! See
Paper: https://arxiv.org/abs/2305.01610
Summary: https://twitter.com/wesg52/status/1653750337373880322

Clarifying mesa-optimization

wesg2y110

There has been some work on understanding in-context learning which suggests that models are doing literal gradient descent:

Superposition allows the model to do a lot of things at once. Thus, if the model wants to use its space efficiently, it performs multiple steps at once or uses highly compressed heuristics even if they don’t co

... (read more)

2Marius Hobbhahn2y

How confident are you that the model is literally doing gradient descent from these papers? My understanding was that the evidence in these papers is not very conclusive and I treated it more as an initial hypothesis than an actual finding. Even if you have the redundancy at every layer, you are still running copies of the same layer, right? Intuitively I would say this is not likely to be more space-efficient than not copying a layer and doing something else but I'm very uncertain about this argument. I intend to look into the Knapsack + DP algorithm problem at some point. If I were to find that the model implements the DP algorithm, it would change my view on mesa optimization quite a bit.

Let's Terraform West Texas

wesg3y41

Many parts of west Texas are also suitable for wind power which could potentially be interspersed within a large solar array. Increasing the power density of the land might make it cost effective to develop high energy industries in the area or justify the cost of additional infrastructure.

Request for Alignment Research Project Recommendations

Answer by wesgSep 03, 202260

One website dedicated to this: https://aisafetyideas.com/

1Rauno Arike3y

Thanks, that definitely seems like a great way to gather these ideas together!

Taking the parameters which seem to matter and rotating them until they don't

wesg3y20

You could hope for more even for a random non-convex optimization problem if you can set up a tight relaxation. E.g. this paper gives you optimality bounds via a semidefinite relaxation, though I am not sure if it would scale to the size of problems relevant here.

Taking the parameters which seem to matter and rotating them until they don't

wesg3y91

Would love to see more in this line of work.

We then can optimize the rotation matrix and its inverse so that local changes in the rotated activation matrix have local effects on the outputted activations.

Could you explain how you are formulating/solving this optimization problem in more detail?

Garrett Baker3y*241

Suppose our model has the following format:

$Model (input) = (M_{3} \circ N L \circ M_{2} \circ N L \circ M_{1}) (input)$

where $M_{3}, M_{2}, M_{1}$ are matrix multiplies, and $N L$ is our nonlinear layer.

We also define a sparsity measure to minimize, chosen for the fun property that it really really really likes zeros compared to almost all other numbers.

$Sparsity (A) = - \sum_{i, j} \frac{1}{| a_{i, j} | + 0.1}$

note that lower sparsity according to this measure means more zeros.

There are two reasonable ways of finding the right rotations. I will describe one way in depth, and the other way not-so in depth. Do note t... (read more)

A Mechanistic Interpretability Analysis of Grokking

wesg3y161

Could you describe your inner thought loop when conducting these sorts of mechanistic analyses? I.e., What Are You Tracking In Your Head?

Deep neural networks are not opaque.

wesg3y42

Indeed, it does seem possible to figure out where simple factual information is stored in the weights of a LLM, and to distinguish between knowing whether it "knows" a fact versus it simply parroting a fact.

Deep Dives: My Advice for Pursuing Work in Research

wesg3y30

In addition to Google scholar, connected papers is a useful tool to quickly sort through related work and get a visual representation of a subarea.