LESSWRONG
LW

Adam Shai — LessWrong

Why do you think finding the true features should make the network look sparse and modular?

Replying toTransformers Represent Belief State Geometry in their Residual Stream

Adam Shai2moReview for 2024 Review

Transformers Represent Belief State Geometry in their Residual Stream

This is a self review. It's been about 600 days since this was posted and I'm still happy and proud about this post. In terms of what I view as the important message to the readership, the main thing is introducing a framework and way of thinking that connects what is a pretty fuzzy notion of "world model" to the concrete internal structure of neural networks. It does this in a way that is both theoretically clear and amenable to experiments. It provides a way to think about representations in transformers in a general sense, that is quite different than the normal tactic taken in interpretability. One interesting thing given how interp... (read 675 more words →)

Adam Shai4moQuick Take

Pre-registering a prediction for experiment results testing why and when and how attention heads share information. We (simplex) will train transformers on data generated from the cartesian product of sequences generated from 2 independent Mess3 processes. If the degenerate eigenvalue of each is the same and positive (e.g. lambda=0.7 for both) then a transformer with a single attention head will learn it, and the attention patterns will be 0.7^(d-s) where d-s is the number of context positions between the source and desitination. If instead one of the processes has lambda = 0.7 and another lambda = 0.3, then this will seperate info. between two heads, where one has attention pattersn of 0.7^(d-s)... (read more)

Adam Shai5mo

This looks great! Thanks for making the videos public.

Any chance the homework assignments/experiments can be made public?

Adam Shai5mo

A fun side note, that probably isn't useful - I think if you shuffle the data across neurons (which effectively gets rid of the covariance amongst neurons), and then do linear regression, you will get theta_t.

This is a somewhat common technique in neuroscience analysis when studying correlation structure in neural reps and separability.

Replying toA quantum equivalent to Bayes' rule

Adam Shai6mo

A quantum equivalent to Bayes' rule

Somewhat related, in our recent preprint we showed how Bayesian updates work over quantum generators of stochastic processes. It's a different setup than the one you show here, but does give a generalization of Bayes to the quantum and even post-quantum setting. We also show that the quantum (and post-quantum) Bayesian belief states are what transformers and other neural nets learn to represent during pre-training. This happens because to predict the future of a sequence optimaly given some history of that sequence, the best you can do is to perform Bayes on the hidden latent states of the (sometimes quantum) generator of the data.

Replying toHow Does A Blind Model See The Earth?

Adam Shai6mo

How Does A Blind Model See The Earth?

Beautiful post, thank you!

Adam Shai6mo

In some sense I agree but I do think it's more nuanced than that in practice. Once you add in cross-entropy loss on next token prediction, alongside causal masking, you really do get a strong sense of "operating on sequences". This is because next token prediction is fundamentally sequential in nature in that the entire task is to make use of the correlational structure of sequences of data in order to predict future sequences.

Replying toSimplex Progress Report - July 2025

Adam Shai7mo

Simplex Progress Report - July 2025

There's actually a market that's even more directly relevant!

Simplex Progress Report - July 2025

Adam Shai

Adam Shai, Paul Riechers, hrbigelow, Eric Alt, mntss

7mo

Thanks to Jasmina Urdshals, Xavier Poncini, and Justis Mills for comments.

Introduction

At Simplex our mission is to develop a principled science of the representations and emergent behaviors of AI systems. Our initial work showed that transformers linearly represent belief state geometries in their residual streams. We think of that work as providing the first steps into an understanding of what fundamentally we are training AI systems to do, and what representations we are training them to have.

Since that time, we have used that framework to make progress in a number of directions, which we will present in the sections below. The projects ask, and provide answers to, the following questions:

How, mechanistically, do transformers use

... (read 4450 more words →)

108

Replying toMech interp is not pre-paradigmatic

Adam Shai8mo

Mech interp is not pre-paradigmatic

This all sounds very reasonable to me! Thanks for the response. I agree that we are likely quite aligned about a lot of these issues.

i'm starting to think recursive self improvement is basically already possible with LLMs, even without anymore training ever. I'm pretty shocked with how much better my coding LLMs have become just by taking care to give the LLMs the right meta-context and information systems. I feel like I've moved from prompting, to figuring out what context is needed in addition to the prompt, to spending a bunch of time/effort building a knowledge structure so that the LLM can figure out its own context to get whatever done, and thats moved me from having LLMs write functions and scripts to large multi-file chunks of entire repositories. And I'm continually having the thought that's like "ok but now I'm building this knowledge system that it can traverse and decide its own relevant context, but why can't the LLM do that too? what would i need to setup for it to do that?" and i'm starting to feel like that's a never ending thing.

I've been trying to get my head around how to theoretically think about scaling test time compute, CoT, reasoning, etc. One frame that keeps on popping into my head is that these methods are a type of un-amortization.

In a more standard inference amortization setup one would e.g. train directly on question/answer pairs without the explicit reasoning path between the question and answer. In that way we pay an up-front cost during training to learn a "shortcut" between question and answers, and then we can use that pre-paid shortcut during inference. And we call that amortized inference.

In the current techniques for using test time compute we do the opposite - we pay costs... (read more)

Some personal reflections on the last year, and some thoughts for next:

1 year ago I quit my career as an academic experimental neuroscientist and began doing AI technical safety research full time. This was emotionally difficult! For more than a decade I had been committed to becoming a neuroscience professor, and had spent a lot of my 20s and 30s pursuing that end. So the move, which had its natural uncertainties (can I succeed in a totally different field? will I be able to support my family financially?) was made more difficult by an ingrained identity as a neuroscientist. In retrospect I wish I had made the move earlier (as Neel Nanda

... (read 453 more words →)

Computational Mechanics Hackathon (June 1 & 2)

Adam Shai

Join our Computational Mechanics Hackathon, organized with the support of APART, PIBBSS and Simplex.

This is an opportunity to learn more about Computational Mechanics, its applications to AI interpretability & safety, and to get your hands dirty by working on a concrete project together with a team and supported by Adam & Paul. Also, there will be cash prizes for the best projects!

Read more and sign up for the event here.

We’re excited about Computational Mechanics as a framework because it provides a rigorous notion of structure that can be applied to both data and model internals. In, Transformers Represent Belief State Geometry in their Residual Stream , we validated that Computational Mechanics can help us... (read 280 more words →)

A neglected problem in AI safety technical research is teasing apart the mechanisms of dangerous capabilities exhibited by current LLMs. In particular, I am thinking that for any model organism ( see Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research) of dangerous capabilities (e.g. sleeper agents paper), we don't know how much of the phenomenon depends on the particular semantics of terms like "goal" and "deception" and "lie" (insofar as they are used in the scratchpad or in prompts or in finetuning data) or if the same phenomenon could be had by subbing in more or less any word. One approach to this is to make small... (read more)

Adam Shai's Shortform

Adam Shai

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

Transformers Represent Belief State Geometry in their Residual Stream

Adam Shai

Produced while being an affiliate at PIBBSS^[1]. The work was done initially with funding from a Lightspeed Grant, and then continued while at PIBBSS. Work done in collaboration with @Paul Riechers, @Lucas Teixeira, @Alexander Gietelink Oldenziel, and Sarah Marzen. Paul was a MATS scholar during some portion of this work. Thanks to Paul, Lucas, Alexander, Sarah, and @Guillaume Corlouer for suggestions on this writeup.

Update May 24, 2024: See our manuscript based on this work

Introduction

What computational structure are we building into LLMs when we train them on next-token prediction? In this post we present evidence that this structure is given by the meta-dynamics of belief updating over hidden states of the data-generating process.... (read 3366 more words →)

101

432

•••

Basic Mathematics of Predictive Coding

Adam Shai

This is an overview of the classic 1999 paper by Rajesh Rao and David Ballard that introduced predictive coding. I'm going to focus on explaining the mathematical setup instead of just staying at a conceptual level. I'll do this in a series of steps of increasing detail and handholding, so that you can grasp the concepts by reading quickly, assuming you are familiar with the mathematical concepts. In addition I have implemented a convnet version of this framework in pytorch, which you can look at in this notebook:

Why I wrote this

The phrases "predictive coding" and "predictive processing" have squeezed their way outside of academia and into the public. They occur quite commonly... (read 2540 more words →)

Geoff Hinton Quits Google

Adam Shai

The NYTimes reports that Geoff Hinton has quit his role at Google:

On Monday, however, he officially joined a growing chorus of critics who say those companies are racing toward danger with their aggressive campaign to create products based on generative artificial intelligence, the technology that powers popular chatbots like ChatGPT.
Dr. Hinton said he has quit his job at Google, where he has worked for more than a decade and became one of the most respected voices in the field, so he can freely speak out about the risks of A.I. A part of him, he said, now regrets his life’s work.
“I console myself with the normal excuse: If I hadn’t done it,

Adam Shai

I've recently been learning about transformers and noticed a failure mode of my learning that has occurred throughout my life: trying to learn a subject from material that deals with the high-level conceptual structure of something instead of learning the mathematical structure more directly. I do not mean to suggest that one needs to focus on hardcore formalizations for everything, but there is a difference between learning the conceptual structure of a subject, and learning the conceptual structure of the mathematical framework of a subject.

The most salient example to me of this phenomenon occurred when I was trying to teach myself quantum mechanics at the end of high school. I voraciously read... (read 456 more words →)

What is a world-model?

Adam Shai

I have seen the concept of world-model used to talk loosely about a neural networks or an agents understanding of the world, but I was wondering what resources (blog posts, journal articles, etc.) exist that talk more precisely about what is a world-model, what are its parts, how do you know if a given system has one or not, if a system does have one, how do you know its structure, does a system necessarily even have a world-model to begin with, even if it's a trivial one? etc.?

Thanks!

Mental Abstractions

Adam Shai

Adam Shai, eihpos

The Primacy of the Abstract is a talk given by the economist F. A. Hayek in 1969. It discusses the relationship between concrete and abstract mental phenomenon. In this post we summarize the main points made in his talk. Here are a few main points:

Mental abstractions are the cause of concrete perceptions. In other words, abstractions can exist without concrete percepts, but concrete percepts can't exist without abstractions.
Particular behavioral outputs are the result of many mental abstractions jointly constraining the organism to perform that behavior
New mental abstractions cannot be designed in a top-down fashion, and are instead chosen from pre-existing abstractions that are found to be useful
The unconscious is more abstracted than

... (read 1332 more words →)

Some thoughts about natural computation and interactions

Adam Shai

Epistemic Status: Ramblings of my current thoughts on computation.

I have been wondering about the nature of computation for some time now. For instance, what do we mean when we say the brain computes? I think the traditional answers are unsatisfactory. Chief amongst these issues is the role of observers in computation.

When an electronic calculator performs addition, it outputs the result by controlling pixels on a screen. Photons bounce off the screen and hit our retina and the brain performs a dizzying array of computational work in order to make sense of that retinal input. However, when discussing the computation the calculator performs, we talk about automata and locations in Chomsky hierarchies and... (read 642 more words →)

LESSWRONG
LW

LESSWRONG
LW

Adam Shai

Transformers Represent Belief State Geometry in their Residual Stream

Simplex Progress Report - July 2025

Geoff Hinton Quits Google

Learn the mathematical structure, not the conceptual structure

Adam Shai

Simplex Progress Report - July 2025

Computational Mechanics Hackathon (June 1 & 2)

Adam Shai's Shortform

Transformers Represent Belief State Geometry in their Residual Stream

Basic Mathematics of Predictive Coding

Geoff Hinton Quits Google

Learn the mathematical structure, not the conceptual structure

Introduction to Computational Mechanics

Adam Shai

Transformers Represent Belief State Geometry in their Residual Stream

Simplex Progress Report - July 2025

Geoff Hinton Quits Google

Learn the mathematical structure, not the conceptual structure

Adam Shai

Simplex Progress Report - July 2025

Computational Mechanics Hackathon (June 1 & 2)

Adam Shai's Shortform

Transformers Represent Belief State Geometry in their Residual Stream

Basic Mathematics of Predictive Coding

Geoff Hinton Quits Google

Learn the mathematical structure, not the conceptual structure

Introduction to Computational Mechanics

Introduction

Introduction

Why I wrote this