All current SAEs I'm aware of seem to score very badly on reconstructing the original model's activations.
If you insert a current SOTA SAE into a language model's residual stream, model performance on next token prediction will usually degrade down to what a model trained with less than a tenth or a hundredth of the original model's compute would get. (This is based on extrapolating with Chinchilla scaling curves at optimal compute). And that's for inserting one SAE at one layer. If you want to study circuits of SAE features, you'll have to insert SAEs in multiple layers at the same time, potentially further degrading performance.
I think many people outside of interp don't realize this. Part of the reason they don’t realize it might be that almost all SAE papers report loss reconstruction scores on a linear scale, rather than on a log scale or an LM scaling curve. Going from 1.5 CE loss to 2.0 CE loss is a lot worse than going from 4.5 CE to 5.0 CE. Under the hypothesis that the SAE is capturing some of the model's 'features' and failing to capture others, capturing only 50% or 10% of the features might still only drop the CE loss by a small fraction of a unit.
So, if someone is jus...
Basically agree - I'm generally a strong supporter of looking at the loss drop in terms of effective compute. Loss recovered using a zero-ablation baseline is really quite wonky and gives misleadingly big numbers.
I also agree that reconstruction is not the only axis of SAE quality we care about. I propose explainability as the other axis - whether we can make necessary and sufficient explanations for when individual latents activate. Progress then looks like pushing this Pareto frontier.
PSA: The conserved quantities associated with symmetries of neural network loss landscapes seem mostly boring.
If you’re like me, then after you heard that neural network loss landscapes have continuous symmetries, you thought: “Noether’s theorem says every continuous symmetry of the action corresponds to a conserved quantity, like how energy and momentum conservation are implied by translation symmetry and angular momentum conservation is implied by rotation symmetry. Similarly, if loss functions of neural networks can have continuous symmetries, these ought to be associated with quantities that stay conserved under gradient descent[1]!”
This is true. But these conserved quantities don’t seem to be insightful the way energy and momentum in physics are. They basically turn out to just be a sort of coordinate basis for the directions along which the loss is flat.
If our network has a symmetry such that there is an abstract coordinate along which we can vary the parameters without changing the loss, then the gradient with respect to that coordinate will be zero. So, whatever value we started with from random initalisation will be the value we stay at....
I want to point out that there are many interesting symmetries that are non-global or data-dependent. These "non-generic" symmetries can change throughout training. Let me provide a few examples.
ReLU networks. Consider the computation involved in a single layer of a ReLU network:
or, equivalently,
(Maybe we're looking at a two-layer network where are the inputs and are the outputs, or maybe we're at some intermediate layer where these variables represent internal activations before and after a given layer.)
Dead neuron . If one of the biases is always larger than the associated preactivation , then the ReLU will always spit out a zero at that index. This "dead" neuron introduces a new continuous symmetry, where you can set the entries of column of to an arbitrary value, without affecting the network's computation ().
Bypassed neuron . Consider the opposite: if for all possible inputs , then neuron will always activate, and the ReLU's nonlinearity effectively vanishes at that index. This intro...
That's what I meant by
If the symmetry only holds for a particular solution in some region of the loss landscape rather than being globally baked into the architecture, the value will still be conserved under gradient descent so long as we're inside that region.
...
One could maybe hold out hope that the conserved quantities/coordinates associated with degrees of freedom in a particular solution are sometimes more interesting, but I doubt it. For e.g. the degrees of freedom we talk about here, those invariants seem similar to the ones in the ReLU rescaling example above.
Dead neurons are a special case of 3.1.1 (low-dimensional activations) in that paper, bypassed neurons are a special case of 3.2 (synchronised non-linearities). Hidden polytopes are a mix 3.2.2 (Jacobians spanning a low-dimensional subspace) and 3.1.1 I think. I'm a bit unsure which one because I'm not clear on what weight direction you're imagining varying when you talk about "moving the vertex". Since the first derivative of the function you're approximating doesn't actually change at this point, there's multiple ways you could do this.
Thank you. As a physicist, I wish I had an easy way to find papers which say "I tried this kind of obvious thing you might be considering and nothing interesting happened."
Many people in interpretability currently seem interested in ideas like enumerative safety, where you describe every part of a neural network to ensure all the parts look safe. Those people often also talk about a fundamental trade-off in interpretability between the completeness and precision of an explanation for a neural network's behavior and its description length.
I feel like, at the moment, these sorts of considerations are all premature and beside the point.
I don't understand how GPT-4 can talk. Not in the sense that I don't have an accurate, human-intuitive description of every part of GPT-4 that contributes to it talking well. My confusion is more fundamental than that. I don't understand how GPT-4 can talk the way a 17th-century scholar wouldn't understand how a Toyota Corolla can move. I have no gears-level model for how anything like this could be done at all. I don't want a description of every single plate and cable in a Toyota Corolla, and I'm not thinking about the balance between the length of the Corolla blueprint and its fidelity as a central issue of interpretability as a field.
What I want right now is a basic understanding of combustion engin...
When doing bottom up interpretability, it's pretty unclear if you can answer questions like "how does GPT-4 talk" without being able to explain arbitrary parts to a high degree of accuracy.
I agree that top down interpretability trying to answer more basic questions seems good. (And generally I think top down interpretability looks more promising than bottom up interpretability at current margins.)
(By interpretability, I mean work aimed at having humans understand the algorithm/approach the model to uses to solve tasks. I don't mean literally any work which involves using the internals of the model in some non-basic way.)
I have no gears-level model for how anything like this could be done at all. [...] What I want right now is a basic understanding of combustion engines. I want to understand the key internal gears of LLMs that are currently completely mysterious to me, the parts where I don't have any functional model at all for how they even could work. What I ultimately want to get out of Interpretability at the moment is a sketch of Python code I could write myself.
It's not obvious to me that what you seem to want exists. I think the way LLMs work might not be well described as having key internal gears or having an at-all illuminating python code sketch.
(I'd guess something sorta close to what you seem to be describing, but ultimately disappointing and mostly unilluminating exists. And something tremendously complex but ultimately pretty illuminating if you fully understood it might exist.)
I very strongly agree with the spirit of this post. Though personally I am a bit more hesitant about what exactly it is that I want in terms of understanding how it is that GPT-4 can talk. In particular I can imagine that my understanding of how GPT-4 could talk might be satisfied by understanding the principles by which it talks, but without necessarily being able to from scratch write a talking machine. Maybe what I'd be after in terms of what I can build is a talking machine of a certain toyish flavor - a machine that can talk in a synthetic/toy language. The full complexity of its current ability seems to have too much structure to be constructed from first princples. Though of course one doesn't know until our understanding is more complete.
This paper claims to sample the Bayesian posterior of NN training, but I think it's wrong.
"What Are Bayesian Neural Network Posteriors Really Like?" (Izmailov et al. 2021) claims to have sampled the Bayesian posterior of some neural networks conditional on their training data (CIFAR-10, MNIST, IMDB type stuff) via Hamiltonian Monte Carlo sampling (HMC). A grand feat if true! Actually crunching Bayesian updates over a whole training dataset for a neural network that isn't incredibly tiny is an enormous computational challenge. But I think they're mistaken and their sampler actually isn't covering the posterior properly.
They find that neural network ensembles trained by Bayesian updating, approximated through their HMC sampling, generalise worse than neural networks trained by stochastic gradient descent (SGD). This would have been incredibly surprising to me if it were true. Bayesian updating is prohibitively expensive for real world applications, but if you can afford it, it is the best way to incorporate new information. You can't do better.[1]
This is kind of in the genre of a lot of papers and takes I think used to be around a few years back, which argued that the then stil...
I think we may be close to figuring out a general mathematical framework for circuits in superposition.
I suspect that we can get a proof that roughly shows:
Crucially, the total number of superposed operations we can carry out scales linearly with the network's parameter count, not its neuron count or attention head count. E.g. if each little subnetwork uses neurons per MLP layer and dimensions in the residual stream, a big network with neurons per MLP connected to a -dimensional residual stream can implement about subnetworks, not just .
This would be a generalization of the construction for boolean logic gates in superposi...
Current LLMs are trivially mesa-optimisers under the original definition of that term.
I don't get why people are still debating the question of whether future AIs are going to be mesa-optimisers. Unless I've missed something about the definition of the term, lots of current AI systems are mesa-optimisers. There were mesa-opimisers around before Risks from Learned Optimization in Advanced Machine Learning Systems was even published.
We will say that a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system.
....
Mesa-optimization occurs when a base optimizer (in searching for algorithms to solve some problem) finds a model that is itself an optimizer, which we will call a mesa-optimizer.
GPT-4 is capable of making plans to achieve objectives if you prompt it to. It can even write code to find the local optimum of a function, or code to train another neural network, making it a mesa-meta-optimiser. If gradient descent is an optimiser, then GPT-4 certainly is.&nbs...
If that were the intended definition, gradient descent wouldn’t count as an optimiser either. But they clearly do count it, else an optimiser gradient descent produces wouldn’t be a mesa-optimiser.
Gradient descent optimises whatever function you pass it. It doesn’t have a single set function it tries to optimise no matter what argument you call it with.
Gradient descent, in this sense of the term, is not an optimizer according to Risks from Learned Optimization.
Consider that Risks from Learned Optimization talks a lot about "the base objective" and "the mesa-objective." This only makes sense if the objects being discussed are optimization algorithms together with specific, fixed choices of objective function.
"Gradient descent" in the most general sense is -- as you note -- not this sort of thing. Therefore, gradient descent in that general sense is not the kind of thing that Risks from Learned Optimization is about.
Gradient descent in this general sense is a "two-argument function," , where is the thing to be optimized and is the objective function. The objects of interest in Risks from Learned Optimization are curried single-arg...
Does the Solomonoff Prior Double-Count Simplicity?
Question: I've noticed what seems like a feature of the Solomonoff prior that I haven't seen discussed in any intros I've read. The prior is usually described as favoring simple programs through its exponential weighting term, but aren't simpler programs already exponentially favored in it just through multiplicity alone, before we even apply that weighting?
Consider Solomonoff induction applied to forecasting e.g. a video feed of a whirlpool, represented as a bit string . The prior probability for any such string is given by:
where ranges over programs for a prefix-free Universal Turing Machine.
Observation: If we have a simple one kilobit program that outputs prediction , we can construct nearly different two kilobit programs that also output by appending arbitrary "dead code" that never executes.
For example:
DEADCODE="[arbitrary 1 kilobit string]"
[original 1 kilobit program ]
EOF
Where programs aren't allowed to have anything follow EOF, to ensure we satisfy the prefix free requirement.
If we compare against another two kilobi...
Has anyone thought about how the idea of natural latents may be used to help formalise QACI?
The simple core insight of QACI according to me is something like: A formal process we can describe that we're pretty sure would return the goals we want an AGI to optimise for is itself often a sufficient specification of those goals. Even if this formal process costs galactic amounts of compute and can never actually be run, not even by the AGI itself.
This allows for some funny value specification strategies we might not usually think about. For example, we could try using some camera recordings of the present day, a for loop, and a code snippet implementing something like Solomonof induction to formally specify the idea of Earth sitting around in a time loop until it has worked out its CEV.
It doesn't matter that the AGI can't compute that. So long as it can reason about what the result of the computation would be without running it, this suffices as a pointer to our CEV. Even if the AGI doesn't manage to infer the exact result of the process, that's fine so long as it can infer some bits of information about the result. This just ends up giving the AGI some moral uncerta...
I do think natural latents could have a significant role to play somehow in QACI-like setups, but it doesn't seem like they let you avoid formalizing, at least in the way you're talking about. It seems more interesting in terms of avoiding specifying a universal prior over possible worlds, if we can instead specify a somewhat less universal prior that bakes in assumptions about our worlds' known causal structure. it might help with getting a robust pointer to the start of the time snippet. I don't see how it helps avoiding specifying "looping", or "time snippet", etc. natural latents seem to me to be primarily about the causal structure of our universe, and it's unclear what they even mean otherwise. it seems like our ability to talk about this concept is made up of a bunch of natural latents, and some of them are kind of messy and underspecified by the phrase, mainly relating to what the heck is a physics.
To me kinda the whole point of QACI is that it tries to actually be fully formalized. Informal definitions seem very much not robust to when superintelligences think about them; fully formalized definitions are the only thing I know of that keep meaning the same thing regardless of what kind of AI looks at it or with what kind of ontology.
I don't really get the whole natural latents ontology at all, and mostly expect it to be too weak for us to be able to get reflectively stable goal-content integrity even as the AI becomes vastly superintelligent. If definitions are informal, that feels to me like degrees of freedom in which an ASI can just pick whichever values make its job easiest.
Perhaps something like this allows use to use current, non-vastly-superintelligent AIs to help design a formalized version of QACI or ESP which itself is robust enough to be passed to superintelligent optimizers; but my response to this is usually "have you tried first formalizing CEV/QACI/ESP by hand?" because it feels like we've barely tried and like reasonable progress can be made on it that way.
Perhaps there are some cleverer schemes where the superintelligent optimizer is pointed at the weaker cur...
Two shovel-ready theory projects in interpretability.
Most scientific work isn't "shovel-ready." It's difficult to generate well-defined, self-contained projects where the path forward is clear without extensive background context. In my experience, this is extra true of theory work, where most of the labour if often about figuring out what the project should actually be, because the requirements are unclear or confused.
Nevertheless, I currently have two theory projects related to computation in superposition in my backlog that I think are valuable and that maybe have reasonably clear execution paths. Someone just needs to crunch a bunch of math and write up the results.
Impact story sketch: We now have some very basic theory for how computation in superposition could work[1]. But I think there’s more to do there that could help our understanding. If superposition happens in real models, better theoretical grounding could help us understand what we’re seeing in these models, and how to un-superpose them back into sensible individual circuits and mechanisms we can analyse one at a time. With sufficient understanding, we might even gain some insight into how circuits develop duri...