If there are ‘subshards’ which achieve this desirable behavior because they, from their own perspective, ‘intrinsically’ desire power (whatever that sort of distinction makes when you’ve broken things down that far), and it is these subshards which implement the instrumental drive... so what? After all, there has to be some level of analysis at which an agent stops thinking about whether or not it should do some thing and just starts doing the thing. Your muscles “intrinsically desire” to fire when told to fire, but the motor actions are still ultimately instrumental, to accomplish something other than individual muscles twitching. You can’t have ‘instrumental desire’ homunculuses all the way down to the individual transistor or ReLU neuron.
I sent this paragraph to TurnTrout as I was curious to get his reaction. Paraphrasing his response below:
No, that's not the point. That's actually the opposite of what i'm trying to say. The subshards implement the algorithmic pieces and the broader agent has an "intrinsic desire" for power. The subshards themselves are not agentic, and that's why (in context) I substitute them in for "circuits".
It's explained in this post that I linked to. Though I guess in context I do say "prioritize" in a way that might be confusing. Shard Theory argues against homonculist accounts of cognition by considering the mechanistic effects of reinforcement processes. Also the subshards are not implementing an instrumental drive in the sense of "implementing the power-seeking behavior demanded by some broader consequentialist plan" they're just seeking power, just 'cuz.
From my early post: Inner and Outer Alignment Decompose One Hard Problem Into Two Extremely Hard Problems
I literally do not understand what the internal cognition is supposed to look like for an inner-aligned agent. Most of what I’ve read has been vague, on the level of “an inner-aligned agent cares about optimizing the outer objective.”
Charles Foster comments:
- "We are attempting to mechanistically explain how an agent makes decisions. One proposed reduction is that inside the agent, there is an even smaller inner agent that interacts with a non-agential evaluative submodule to make decisions for the outer agent. But that raises the immediate questions of “How does the inner agent make its decisions about how to interact with the evaluative submodule?” and then “At some point, there’s gotta be some non-agential causal structure that is responsible for actually implementing decision-making, right?” and then “Can we just explain the original agent’s behavior in those terms? What is positing an externalized evaluative submodule buying us?"
Perhaps my emphasis on mechanistic reasoning and my unusual level of precision in my speculation about AI internals, perhaps these make people realize how complicated realistic cognition is in the shard picture. Perhaps people realize how much might have to go right, how many algorithmic details may need to be etched into a network so that it does what we want and generalizes well.
But perhaps people don’t realize that a network which is inner-aligned on an objective will also require a precise and conforming internal structure, and they don’t realize this because no one has written detailed plausible stabs at inner-aligned cognition.
I would recommend reading the original reddit post that motivated it: https://www.reddit.com/r/biology/comments/16y81ct/the_case_for_whales_actually_matching_or_even/.
It is meant seriously, but the author is rightly acknowledging how far-fetched it sounds.
[00:31:25] Timothy:... This is going to be like, they didn't talk about any content, like there's no specific evidence,
[00:31:48] Elizabeth: I wrote down my evidence ahead of time.
[00:31:49] Timothy: Yeah, you already wrote down your evidence
I feel pretty uncertain to what extent I agree with your views on EA. But this podcast didn't really help me decide because there wasn't much discussion of specific evidence. Where is all of it written down? I'm aware of your post on vegan advocacy but unclear if there are lots more examples. I also heard a similar line of despair about EA epistemics from other long-time rationalists when hanging around Lighthaven this summer. But basically no one brought up specific examples.
It seems difficult to characterize the EA movement as a monolith in the way you're trying to do. The case of vegan advocacy is mostly irrelevant to my experience of EA. I have little contact with vegan advocates and most of the people I hang around in EA circles seem to have quite good epistemics.
However I can relate to your other example, because I'm one of the "baby EAs" who was vegetarian and was in the Lightcone offices in summer 2022. But my experience provides something of a counter-example. In fact, I became vegetarian before encountering EA and mostly found out about the potential nutritional problems from other EAs. When you wrote your post, I got myself tested for iron deficiency and started taking supplements (although not for iron deficiency). I eventually stopped being vegetarian, instead offsetting my impact with donations to animal charities, even though this isn't very popular in EA circles.
My model is that people exist on a spectrum of weirdness to normy-ness. The weird people are often willing to pay social costs to be more truthful. While the more normy people will refrain from saying and thinking the difficult truths. But most people are mostly fixed at a certain point on the spectrum. The truth-seeking weirdos probably made up a larger proportion of the early EA movement, but I'd guess in absolute terms the number of those sorts of people hanging around EA spaces has not declined, and their epistemics have not degraded - there just aren't very many of them in the world. But these days there is a greater number of the more normy people in EA circles too.
And yes, it dilutes the density of high epistemics in EA. But that doesn't seem like a reason to abandon the movement. It is a sign that more people are being influenced by good ideas and that creates opportunities for the movement to do bigger things.
When you want to have interesting discussions with epistemic peers, you can still find your own circles within the movement to spend time with, and you can still come to the (relative) haven of LessWrong. If LessWrong culture also faced a similar decline in epistemic standards I would be much more concerned, but it has always felt like EA is the applied, consumer facing product of the rationalist movement, that targets real-world impact over absolute truth-seeking. For example, I think most EAs (and also some rationalists) are hopelessly confused about moral philosophy, but I'm still happy there's more people trying to live by utilitarian principles, who might otherwise not be trying to maximize value at all.
Respect for doing this.
I strongly wish you would not tie StopAI to the claim that extinction is >99% likely. It means that even your natural supporters in PauseAI will have to say "yes I broadly agree with them but disagree with their claims about extinction being certain."
I would also echo the feedback here. There's no reason to write in the same style as cranks.
Question –> CoT –> Answer
So to be clear, testing whether this causal relationship holds is actually important, it's just that we need to do it on questions where the CoT is required for the model to answer the question?
Optimize the steering vector to minimize some loss function.
Crossposted from https://x.com/JosephMiller_/status/1839085556245950552
1/ Sparse autoencoders trained on the embedding weights of a language model have very interpretable features! We can decompose a token into its top activating features to understand how the model represents the meaning of the token.🧵
2/ To visualize each feature, we project the output direction of the feature onto the token embeddings to find the most similar tokens. We also show the bottom and median tokens by similarity, but they are not very interpretable.
3/ The token "deaf" decomposes into features for audio and disability! None of the examples in this thread are cherry-picked – they were all (really) randomly chosen.
4/ Usually SAEs are trained on the internal activations of a component for billions of different input strings. But here we just train on the rows of the embedding weight matrix (where each row is the embedding for one token).
5/ Most SAEs have many thousands of features. But for our embedding SAE, we only use 2000 features because of our limited dataset. We are essentially compressing the embedding matrix into a smaller, sparser representation.
6/ The reconstructions are not highly accurate – on average we have ~60% variance unexplained (~0.7 cosine similarity) with ~6 features active per token. So more work is needed to see how useful they are.
7/ Note that for this experiment we used the subset of the token embeddings that correspond to English words, so the task is easier - but the results are qualitatively similar when you train on all embeddings.
8/ We also compare to PCA directions and find that the SAE directions are in fact much more interpretable (as we would expect)!
9/ I worked on embedding SAEs at an @apartresearch hackathon in April, with Sajjan Sivia and Chenxing (June) He.
Embedding SAEs were also invented independently by @Michael Pearce.
Nice post. I think this is a really interesting discovery.
[Copying from messages with Joseph Bloom]
TLDR: I'm confused what is different about the SAE input that causes the absorbed feature not to fire.
Me:
Summary of your findings
- Say you have a “starts with s” feature and a “snake” feature.
- You find that for most words, “starts with s” correctly categorizes words that start with s. But for a few words that start with s, like snake, it doesn’t fire.
- These exceptional tokens where it doesn’t fire, all have another feature that corresponds very closely to the token. For example, there is a “snake” feature that corresponds strongly to the snake token.
- You say that the “snake” feature has absorbed the “starts with s” feature because the concept of snake also contains/entails the concept of ‘start with s’.
- Most of the features that absorb other features correspond to common words, like “and”.
So why is this happening? Well it makes sense that the model can do better on L1 on the snake token by just firing a single “snake” feature (rather than the “starts with s” feature and, say, the “reptile” feature). And it makes sense it would only have enough space to have these specific token features for common tokens.
Joseph Bloom:
rather than the “starts with s” feature and, say, the “reptile” feature
We found cases of seemingly more general features getting absorbed in the context of spelling but they are more rare / probably the exception. It's worth distinguishing that we suspect that feature absorption is just easiest to find for token aligned features but conceptually could occur any time a similar structure exists between features.
And it makes sense it would only have enough space to have these specific token features for common tokens.
I think this needs further investigation. We certainly sometimes see rarer tokens which get absorbed (eg: a rare token is a translated word of a common token). I predict there is a strong density effect but it could be non-trivial.
Me:
We found cases of seemingly more general features getting absorbed in the context of spelling
What’s an example?
We certainly sometimes see rarer tokens which get absorbed (eg: a rare token is a translated word of a common token)
You mean like the “starts with s” feature could be absorbed into the “snake” feature on the french word for snake?
Do this only happen if the french word also starts with s?
Joseph Bloom:
What’s an example?
- Latent aligned for a few words at once. Eg: "assistance" but fires weakly on "help". We saw it absorb both "a" and "h"!
- Latent from multi-token words that also fires on words that share a common prefix (https://feature-absorption.streamlit.app/?layer=16&sae_width=65000&sae_l0=128&letter=c see latent 26348).
- Latent than fires on a token + weakly on subsequent tokens https://feature-absorption.streamlit.app/?layer=16&sae_width=65000&sae_l0=128&letter=c
You mean like the “starts with s” feature could be absorbed into the “snake” feature on the french word for snake?
Yes
Do this only happen if the french word also starts with s?
More likely. I think the process is stochastic so it's all distributions.
Me:
But here’s what I’m confused about. How does the “starts with s” feature ‘know’ not to fire? How is it able to fire on all words that start with s, except those tokens (like “snake”) that having a strongly correlated feature? I would assume that the token embeddings of the model contain some “starts with s” direction. And the “starts with s” feature input weights read off this direction. So why wouldn’t it also activate on “snake”? Surely that token embedding also has the “starts with s” direction?
Joseph Bloom:
I would assume that the token embeddings of the model contain some “starts with s” direction. And the “starts with s” feature input weights read off this direction. So why wouldn’t it also activate on “snake”? Surely that token embedding also has the “starts with s” direction?
I think the success of the linear probe is why we think the snake token does have the starts with s direction. The linear probe has much better recall and doesn't struggle with obvious examples. I think the feature absorption work is not about how models really work, it's about how SAEs obscure how models work.
But here’s what I’m confused about. How does the “starts with s” feature ‘know’ not to fire? Like what is the mechanism by which it fires on all words that start with s, except those tokens (like “snake”) that having a strongly correlated feature?
Short answer, I don't know. Long answer - some hypotheses:
- Linear probes, can easily do calculations of the form "A AND B". In large vector spaces, it may be possible to learn a direction of the form "(^S.*) AND not (snake) and not (sun) ...". Note that "snake" has a component seperate to starts with s so this is possible. To the extent this may be hard, that's possibly why we don't see more absorption but my own intuition says that in large vector spaces this should be perfectly possible to do.
- Encoder weights and Decoder weights aren't tied. If they were, you can imagine the choosing these exceptions for absorbed examples would damage reconstruction performance. Since we don't tie the weights, the model can detect "(^S.*) AND not (snake) and not (sun) ..." but write "(^S.*)". I'm interested to explore this further and am sad we didn't get to this in the project.
This is fantastic technical writing. It would have taken me hours to understand these papers this deeply, but you convey the core insights quickly in an entertaining and understandable way.