This is a special post for quick takes by Lucius Bushnaq. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
70 comments, sorted by Click to highlight new comments since:

All current SAEs I'm aware of seem to score very badly on reconstructing the original model's activations.

If you insert a current SOTA SAE into a language model's residual stream, model performance on next token prediction will usually degrade down to what a model trained with less than a tenth or a hundredth of the original model's compute would get. (This is based on extrapolating with Chinchilla scaling curves at optimal compute). And that's for inserting one SAE at one layer. If you want to study circuits of SAE features, you'll have to insert SAEs in multiple layers at the same time, potentially further degrading performance.

I think many people outside of interp don't realize this. Part of the reason they don’t realize it might be that almost all SAE papers report loss reconstruction scores on a linear scale, rather than on a log scale or an LM scaling curve. Going from 1.5 CE loss to 2.0 CE loss is a lot worse than going from 4.5 CE to 5.0 CE. Under the hypothesis that the SAE is capturing some of the model's 'features' and failing to capture others, capturing only 50% or 10% of the features might still only drop the CE loss by a small fraction of a unit.

So, if someone is jus... (read more)

[-]leogao106

Basically agree - I'm generally a strong supporter of looking at the loss drop in terms of effective compute. Loss recovered using a zero-ablation baseline is really quite wonky and gives misleadingly big numbers.

I also agree that reconstruction is not the only axis of SAE quality we care about. I propose explainability as the other axis - whether we can make necessary and sufficient explanations for when individual latents activate. Progress then looks like pushing this Pareto frontier.

4Neel Nanda
This seems true to me, though finding the right scaling curve for models is typically quite hard so the conversion to effective compute is difficult. I typically use CE loss change, not loss recovered. I think we just don't know how to evaluate SAE quality. My personal guess is that SAEs can be a useful interpretability tool despite making a big difference in effective compute, and we should think more in terms of useful they are for downstream tasks. But I agree this is a real phenomena, that is easy to overlook, and is bad.
4Alexander Gietelink Oldenziel
As a complete noob in all things mechinterp can somebody explain how this is not in conflict with SAE enjoyers saying they get reconstruction loss in the high 90s or even 100 %? I understand the logscale argument that Lucius is making but still seems surprising ? Is this really what's going on or are they talking about different things here.
8ryan_greenblatt
The key question is 90% recovered relative to what. If you recover 90% of the loss relative to a 0 ablation baseline (that ablates the entire residual stream midway though the model!), that isn't clearly that much. E.g., if full zero ablation is 13 CE loss (seems plausible) and the SAE gets you to 3 CE while the original model was at 2 CE, this is 90%, but you have also massively degraded performance in terms of effective training compute. IDK about literal 100%.
2Lucius Bushnaq
The metric you mention here is probably 'loss recovered'. For a residual stream insertion, it goes 1-(CE loss with SAE- CE loss of original model)/(CE loss if the entire residual stream is ablated-CE loss of original model) See e.g. equation 5 here. So, it's a linear scale, and they're comparing the CE loss increase from inserting the SAE to the CE loss increase from just destroying the model and outputting a ≈ uniform distribution over tokens. The latter is a very large CE loss increase, so the denominator is really big. Thus, scoring over 90% is pretty easy.   
2Sodium
Have people done evals for a model with/without an SAE inserted? Seems like even just looking at drops in MMLU performance by category could be non-trivially informative. 
2Lucius Bushnaq
I've seen a little bit of this, but nowhere near as much as I think the topic merits. I agree that systematic studies on where and how the reconstruction errors make their effects known might be quite informative. Basically, whenever people train SAEs, or use some other approximate model decomposition that degrades performance, I think they should ideally spend some time after just playing with the degraded model and talking to it. Figure out in what ways it is worse.
1Sodium
Hmmm ok maybe I’ll take a look at this :)
1keith_wynroe
What are your thoughts on KL-div after the unembed softmax as a metric?
3Lucius Bushnaq
On its own, this'd be another metric that doesn't track the right scale as models become more powerful. The same KL-div in GPT-2 and GPT-4 probably corresponds to the destruction of far more of the internal structure in the latter than the former.   Destroy 95% of GPT-2's circuits, and the resulting output distribution may look quite different. Destroy 95% of GPT-4's circuits, and the resulting output distribution may not be all that different, since 5% of the circuits in GPT-4 might still be enough to get a lot of the most common token prediction cases roughly right.
2Neel Nanda
I don't see important differences between that and ce loss delta in the context Lucius is describing

PSA: The conserved quantities associated with symmetries of neural network loss landscapes seem mostly boring.

If you’re like me, then after you heard that neural network loss landscapes have continuous symmetries, you thought: “Noether’s theorem says every continuous symmetry of the action corresponds to a conserved quantity, like how energy and momentum conservation are implied by translation symmetry and angular momentum conservation is implied by rotation symmetry. Similarly, if loss functions of neural networks can have continuous symmetries, these ought to be associated with quantities that stay conserved under gradient descent[1]!” 

This is true. But these conserved quantities don’t seem to be insightful the way energy and momentum in physics are. They basically turn out to just be a sort of coordinate basis for the directions along which the loss is flat. 

If our network has a symmetry such that there is an abstract coordinate  along which we can vary the parameters without changing the loss, then the gradient with respect to that coordinate will be zero. So, whatever  value we started with from random initalisation will be the value we stay at.... (read more)

I want to point out that there are many interesting symmetries that are non-global or data-dependent. These "non-generic" symmetries can change throughout training. Let me provide a few examples. 

ReLU networks. Consider the computation involved in a single layer of a ReLU network:

or, equivalently,

(Maybe we're looking at a two-layer network where  are the inputs and  are the outputs, or maybe we're at some intermediate layer where these variables represent internal activations before and after a given layer.)

Dead neuron . If one of the biases  is always larger than the associated preactivation , then the ReLU will always spit out a zero at that index. This "dead" neuron introduces a new continuous symmetry, where you can set the entries of column  of  to an arbitrary value, without affecting the network's computation (). 

Bypassed neuron . Consider the opposite: if  for all possible inputs , then neuron  will always activate, and the ReLU's nonlinearity effectively vanishes at that index. This intro... (read more)

That's what I meant by

If the symmetry only holds for a particular solution in some region of the loss landscape rather than being globally baked into the architecture, the  value will still be conserved under gradient descent so long as we're inside that region.

...

One could maybe hold out hope that the conserved quantities/coordinates associated with degrees of freedom in a particular solution are sometimes more interesting, but I doubt it. For e.g. the degrees of freedom we talk about here, those invariants seem similar to the ones in the ReLU rescaling example above.

Dead neurons are a special case of 3.1.1 (low-dimensional activations) in that paper, bypassed neurons are a special case of 3.2 (synchronised non-linearities). Hidden polytopes are a mix 3.2.2 (Jacobians spanning a low-dimensional subspace) and 3.1.1 I think. I'm a bit unsure which one because I'm not clear on what weight direction you're imagining varying when you talk about "moving the vertex". Since the first derivative of the function you're approximating doesn't actually change at this point, there's multiple ways you could do this.

[-]Tahp1713

Thank you. As a physicist, I wish I had an easy way to find papers which say "I tried this kind of obvious thing you might be considering and nothing interesting happened."

8Dmitry Vaintrob
Yeah I was somewhat annoyed that early SLT made such a big deal out of them. These are boring, spurious things, and another useful intuition is a rough idea (not always true, but more often than not) that "no information that requires your activation to be a ReLU and fails to work well with the approximation theorem is useful for interp". I recently did a deep dive into physics and SLT with PIBBSS colleague Lauren Greenspan, that I'm going to write about at some point this month. My understanding there is that there is a plausibly useful type of symmetry that you can try to think about in a Noether-esque way: this is the symmetry of a model before being initialized or seeing any data. Namely, in the standard physics point of view, you view a choice of weights as a field (so whatever processes that happen are integrated over the prior of weight initializations in a path integral fashion) and you view input-output examples as experimental data (so the stuff that goes into the collider -- the behavior on a new datapoint can be thought of as a sort of the "output" of the scattering experiment). The point is that the substrate on which physicists see symmetries happens before the symmetry breaking inherent in "performing the experiment", i.e., training on any inputs or choosing any weights. Here the standard initialization assumption has orthogonal O(d) symmetry at every layer, for d the width (Edited to clarify: here if you have some inputs x_1, .., x_n then the probability of seeing activations y_1, .., y_n at layer d at initialization is equal to the probability of seeing activations R(y_1), .., R(y_n) for R a rotation matrix. This means that the "vacuum" prior on tuples y_1, .., y_n -- which later gets "symmetry broken" via Bayesian updating or SGD -- will be invariant with respect to hitting each layer of activations with a rotation matrix R). If the width is big, this is a very big symmetry group which is useful for simplifying the analysis (this is implicitly us
6Razied
More insightful than what is conserved under the scaling symmetry of ReLU networks is what is not conserved: the gradient. Scaling w1 by α scales ∂E/∂w1 by 1/α and ∂E/∂w2 by α, which means that we can obtain arbitrarily large gradient norms by simply choosing small enough α. And in general bad initializations can induce large imbalances in how quickly the parameters on either side of the neuron learn.  Some time ago I tried training some networks while setting these symmetries to the values that would minimize the total gradient norm, effectively trying to distribute the gradient norm as equally as possible throughout the network. This significantly accelerated learning, and allowed extremely deep (100+ layers) networks to be trained without residual layers. This isn't that useful for modern networks because batchnorm/layernorm seems to effectively do the same thing, and isn't dependent on having ReLU as the activation function.  Minor detail, but this is false in practice because we are doing gradient descent with a non-zero learning rate, so there will be some diffusion between different hyperbolas in weight space as we take gradient steps of finite size.
4Lucius Bushnaq
See footnote 1.

Many people in interpretability currently seem interested in ideas like enumerative safety, where you describe every part of a neural network to ensure all the parts look safe. Those people often also talk about a fundamental trade-off in interpretability between the completeness and precision of an explanation for a neural network's behavior and its description length. 

I feel like, at the moment, these sorts of considerations are all premature and beside the point.  

I don't understand how GPT-4 can talk. Not in the sense that I don't have an accurate, human-intuitive description of every part of GPT-4 that contributes to it talking well. My confusion is more fundamental than that. I don't understand how GPT-4 can talk the way a 17th-century scholar wouldn't understand how a Toyota Corolla can move. I have no gears-level model for how anything like this could be done at all. I don't want a description of every single plate and cable in a Toyota Corolla, and I'm not thinking about the balance between the length of the Corolla blueprint and its fidelity as a central issue of interpretability as a field. 

What I want right now is a basic understanding of combustion engin... (read more)

When doing bottom up interpretability, it's pretty unclear if you can answer questions like "how does GPT-4 talk" without being able to explain arbitrary parts to a high degree of accuracy.

I agree that top down interpretability trying to answer more basic questions seems good. (And generally I think top down interpretability looks more promising than bottom up interpretability at current margins.)

(By interpretability, I mean work aimed at having humans understand the algorithm/approach the model to uses to solve tasks. I don't mean literally any work which involves using the internals of the model in some non-basic way.)

I have no gears-level model for how anything like this could be done at all. [...] What I want right now is a basic understanding of combustion engines. I want to understand the key internal gears of LLMs that are currently completely mysterious to me, the parts where I don't have any functional model at all for how they even could work. What I ultimately want to get out of Interpretability at the moment is a sketch of Python code I could write myself.

It's not obvious to me that what you seem to want exists. I think the way LLMs work might not be well described as having key internal gears or having an at-all illuminating python code sketch.

(I'd guess something sorta close to what you seem to be describing, but ultimately disappointing and mostly unilluminating exists. And something tremendously complex but ultimately pretty illuminating if you fully understood it might exist.)

3eye96458
What motivates your believing that?

I very strongly agree with the spirit of this post. Though personally I am a bit more hesitant about what exactly it is that I want in terms of understanding how it is that GPT-4 can talk. In particular I can imagine that my understanding of how GPT-4 could talk might be satisfied by understanding the principles by which it talks, but without necessarily being able to from scratch write a talking machine. Maybe what I'd be after in terms of what I can build is a talking machine of a certain toyish flavor - a machine that can talk in a synthetic/toy language. The full complexity of its current ability seems to have too much structure to be constructed from first princples. Though of course one doesn't know until our understanding is more complete.

5RogerDearnaley
Interesting question. I'd suggest starting by doing interpretability on some of the TinyStories models and corpus: they have models with as few as 1–2 layers, 64-or-more dimensional embeddings, and only millions of parameters that can talk (childish) English. That sounds like the sort of thing that might actually be enumerable, with enough work. I think trying to figure that that might be a great testing ground for current ideas in interpretability: large enough to not be a toy model, but small enough to hopefully be tractable.
3StefanHex
The tiny story status seems quite simple, in the sense that I can see how you could provide TinyStories levels of loss by following simple rules plus a bunch of memorization. Empirically, one of the best models in the tiny stories paper is a super wide 1L transformer, which basically is bigrams, trigrams, and slightly more complicated variants [see Bucks post] but nothing that requires a step of reasoning. I am actually quite uncertain where the significant gap between TinyStories, GPT-2 and GPT-4 is. Maybe I could fully understand TinyStories-1L if I tried, would this tell us about GPT-4? I feel like the result for TinyStories will be a bunch of heuristics.
3jow
Is that TinyStories model a super-wide attention-only transformer (the topic of the mechanistic interp work and Buck’s post you cite). I tried to figure it out briefly and couldn’t tell, but I bet it isn’t, and instead has extra stuff like an MLP block. Regardless, in my view it would be a big advance to really understand how the TinyStories models work. Maybe they are “a bunch of heuristics” but maybe that’s all GPT-4, and our own minds, are as well…
1StefanHex
That model has an Attention and MLP block (GPT2-style model with 1 layer but a bit wider, 21M params). I changed my mind over the course of this morning. TheTinyStories models' language isn't that bad, and I think it'd be a decent research project to try to fully understand one of these. I've been playing around with the models this morning, quotes from the 1-layer model: and This feels like the kind of inconsistency I expect from a model that has only one layer. It can recall that the story was about flying and stuff, and the names, but it feels a bit like the model doesn't remember what it said a paragraph before. 2-layer model: and I think if we can fully understand (in the Python code sense, probably with a bunch of lookup tables) how these models work this will give us some insight into where we're at with interpretability. Do the explanations feel sufficiently compressed? Does it feel like there's a simpler explanation that the code & tables we've written? Edit: Specifically I'm thinking of * Train SAEs on all layers * Use this for Attention QK circuits (and transform OV circuit into SAE basis, or Transcoder basis) * Use Transcoders for MLPs (Transcoders vs SAEs are somewhat redundant / different approaches, figure out how to connect everything together)
2RogerDearnaley
Yup: the 1L model samples are full of non-sequiturs, to the level I can't imagine a human child telling a story that badly; whereas the first 2L model example has maybe one non-sequitur/plot jump (the way the story ignores the content of bird's first line of dialog), which the rest of the story then works into it so it ends up almost making sense, in retrospect (except it would have made better sense if the bear had said that line). They second example has a few non-sequiturs, but they're again not glaring and continuous the way the 1L output is. (As a parent) I can imagine a rather small human child telling a story with about the 2L level of plot inconsistencies.
2RogerDearnaley
From rereading the Tiny Stories paper, the 1L model did a really bad job of maintaining the internal consistency of the story and figuring out and allowing for the logical consequences of events, but otherwise did a passably good job of speaking coherent childish English. So the choice on transformer block count would depend on how interested you are in learning how to speak English that is coherent as well as grammatical. Personally I'd probably want to look at something in the 3–4-layer range, so it has an input layer, and output layer, and at least one middle layer, and might actually contain some small circuits. I would LOVE to have an automated way of converting a Tiny Stories-size transformer to some form of declarative language spaghetti code. It would probably help to start with a heavily-quantized version. For example, a model trained using the techniques of the recent paper on building AI using trinary logic (so roughly a 1.6-bit quantization, and eliminating matrix multiplication entirely) might be a good place to start, combined with the sort of techniques the model-pruning folks have been working on for which model-internal interactions are important on the training set and which are just noise and can be discarded. I strongly suspect that every transformer model is just a vast pile of heuristics. In certain cases, if trained on a situation that genuinely is simple and has a specific algorithm to solve it runnable during a model forward-pass (like modular arithmetic, for example), and with enough data to grok it, then the resulting heuristic may actually be an elegant True Name algorithm for the problem. Otherwise, it's just going to be a pile of heuristics that SGD found and tuned. Fortunately SGD (for reasons that singular learning theory illuminates) has a simplicity bias that gives a prior that acts like Occam's Razor or a Kolmogorov Complexity prior, so tends to prefer algorithms that generalize well (especially as the amount of data tends to inf
3Bogdan Ionut Cirstea
How would you operationalize this in ML terms? E.g. how much loss in performance would you consider acceptable, on how wide a distribution of e.g. GPT-4's capabilities, how many lines of python code, etc.? Would you consider acceptable existing rough theoretical explanations, e.g. An Information-Theoretic Analysis of In-Context Learning? (I suspect not, because no 'sketch of python code' feasibility). 
1Bogdan Ionut Cirstea
(I'll note that by default I'm highly skeptical of any current-day-human producing anything like a comprehensible, not-extremely-long 'sketch of Python code' of GPT-4 in a reasonable amount of time. For comparison, how hopeful would you be of producing the same for a smart human's brain? And on some dimensions - e.g. knowledge - GPT-4 is vastly superhuman.)
2RogerDearnaley
I think OP just wanted some declarative code (I don't think Python is the ideal choice of language, but basically anything that's not a Turing tarpit is fine) that could speak fairly coherent English. I suspect if you had a functional transformer decompiler the results aof appling it to a Tiny Stories-size model are going to be tens to hundreds of megabytes of spaghetti, so understanding that in detail is going to be huge slog, but on the other hand, this is an actual operationalization of the Chinese Room argument (or in this case, English Room)! I agree it would be fascinating, if we can get a significant fraction of the model's perplexity score. If it is, as people seem to suspect, mostly or entirely a pile of spaghetti, understanding even a representative (frequency-of-importance biased) statistical sample of it (say, enough for generating a few specific sentences) would still be fascinating.
1Jason Gross
This is the wrong 'length'. The right version of brute-force length is not "every weight and bias in the network" but "the program trace of running the network on every datapoint in pretrain". Compressing the explanation (not just the source code) is the thing connected to understanding. This is what we found from getting formal proofs of model behavior in Compact Proofs of Model Performance via Mechanistic Interpretability. Does the 17th-century scholar have the requisite background to understand the transcript of how bringing the metal plates in the spark plug close enough together results in the formation of a spark? And how gasoline will ignite and expand? I think given these two building blocks, a complete description of the frame-by-frame motion of the Toyota Corolla would eventually convince the 17th-century scholar that such motion is possible, and what remains would just be fitting the explanation into their head all at once. We already have the corresponding building blocks for neural nets: floating point operations.

This paper claims to sample the Bayesian posterior of NN training, but I think it's wrong.

"What Are Bayesian Neural Network Posteriors Really Like?" (Izmailov et al. 2021) claims to have sampled the Bayesian posterior of some neural networks conditional on their training data (CIFAR-10, MNIST, IMDB type stuff) via Hamiltonian Monte Carlo sampling (HMC). A grand feat if true! Actually crunching Bayesian updates over a whole training dataset for a neural network that isn't incredibly tiny is an enormous computational challenge. But I think they're mistaken and their sampler actually isn't covering the posterior properly.

They find that neural network ensembles trained by Bayesian updating, approximated through their HMC sampling, generalise worse than neural networks trained by stochastic gradient descent (SGD). This would have been incredibly surprising to me if it were true. Bayesian updating is prohibitively expensive for real world applications, but if you can afford it, it is the best way to incorporate new information. You can't do better.[1] 

This is kind of in the genre of a lot of papers and takes I think used to be around a few years back, which argued that the then stil... (read more)

8Daniel Murfet
Thanks Lucius. This agrees with my take on that paper and I'm glad to have this detailed comment to refer people to in the future.
5Alexander Gietelink Oldenziel
It's still wild to me that highly cited papers in this space can make such elementary errors. 
3Archimedes
Do you have any papers or other resources you'd recommend that cover the latest understanding? What is the SOTA for Bayesian NNs?

I think we may be close to figuring out a general mathematical framework for circuits in superposition. 

I suspect that we can get a proof that roughly shows:

  1. If we have a set of  different transformers, with parameter counts  implementing e.g. solutions to  different tasks
  2. And those transformers are robust to size  noise vectors being applied to the activations at their hidden layers
  3. Then we can make a single transformer with  total parameters that can do all  tasks, provided any given input only asks for  tasks to be carried out

Crucially, the total number of superposed operations we can carry out scales linearly with the network's parameter count, not its neuron count or attention head count. E.g. if each little subnetwork uses  neurons per MLP layer and  dimensions in the residual stream,  a big network with  neurons per MLP connected to a -dimensional residual stream can implement about  subnetworks, not just 

This would be a generalization of the construction for boolean logic gates in superposi... (read more)

2Lucius Bushnaq
Spotted just now.  At a glance, this still seems to be about boolean computation though. So I think I should still write up the construction I have in mind. Status on the proof: I think it basically checks out for residual MLPs. Hoping to get an early draft of that done today. This will still be pretty hacky in places, and definitely not well presented. Depending on how much time I end up having and how many people collaborate with me, we might finish a writeup for transformers in the next two weeks.

Current LLMs are trivially mesa-optimisers under the original definition of that term.

I don't get why people are still debating the question of whether future AIs are going to be mesa-optimisers. Unless I've missed something about the definition of the term, lots of current AI systems are mesa-optimisers. There were mesa-opimisers around before Risks from Learned Optimization in Advanced Machine Learning Systems was even published. 

We will say that a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system.
....
Mesa-optimization occurs when a base optimizer (in searching for algorithms to solve some problem) finds a model that is itself an optimizer, which we will call a mesa-optimizer.

GPT-4 is capable of making plans to achieve objectives if you prompt it to. It can even write code to find the local optimum of a function, or code to train another neural network, making it a mesa-meta-optimiser. If gradient descent is an optimiser, then GPT-4 certainly is.&nbs... (read more)

8Bogdan Ionut Cirstea
I suspect a lot of the disagreement might be about whether LLMs are something like consistent / context-independent optimizers of e.g. some utility function (they seem very unlikely to), not whether they're capable of optimization in various (e.g. prompt-dependent, problem-dependent) contexts. 
6Bogdan Ionut Cirstea
The top comment also seems to be conflating whether a model is capable of (e.g. sometimes, in some contexts) mesaoptimizing and whether it is (consistently) mesaoptimizing. I interpret the quoted original definition as being about the second, which LLMs probably aren't, though they're capable of the first. This seems like the kind of ontological confusion that the Simulators post discusses at length.
3Lucius Bushnaq
If that were the intended definition, gradient descent wouldn’t count as an optimiser either. But they clearly do count it, else an optimiser gradient descent produces wouldn’t be a mesa-optimiser. Gradient descent optimises whatever function you pass it. It doesn’t have a single set function it tries to optimise no matter what argument you call it with. If you don’t pass any valid function, it doesn’t optimise anything. GPT-4, taken by itself, without a prompt, will optimise pretty much whatever you prompt it to optimise. If you don’t prompt it to optimise something, it usually doesn’t optimise anything. I guess you could say GPT-4, unlike gradient descent, can do things other than optimise something. But if ever not optimising things excluded you from being an optimiser, humans wouldn’t be considered optimisers either. So it seems to me that the paper just meant what it said in the quote. If you look through a search space to accomplish an objective, you are, at present, an optimiser.

If that were the intended definition, gradient descent wouldn’t count as an optimiser either. But they clearly do count it, else an optimiser gradient descent produces wouldn’t be a mesa-optimiser.

Gradient descent optimises whatever function you pass it. It doesn’t have a single set function it tries to optimise no matter what argument you call it with.

Gradient descent, in this sense of the term, is not an optimizer according to Risks from Learned Optimization.

Consider that Risks from Learned Optimization talks a lot about "the base objective" and "the mesa-objective."  This only makes sense if the objects being discussed are optimization algorithms together with specific, fixed choices of objective function.

"Gradient descent" in the most general sense is -- as you note -- not this sort of thing.  Therefore, gradient descent in that general sense is not the kind of thing that Risks from Learned Optimization is about.

Gradient descent in this general sense is a "two-argument function," , where  is the thing to be optimized and  is the objective function.  The objects of interest in Risks from Learned Optimization are curried single-arg... (read more)

2Gunnar_Zarncke
It is surely possible that there are mesa optimizers present in many, even relatively simple LLMs. But the question is: How powerful are these? How large is the state space that they can search through, for example? The state space of the mesa-optimizer can't be larger than the the context window it is using to generate the answer, for example, while the state space of the full LLM is much bigger - basically all its weights. 
1ProgramCrafter
Do current LLMs produce several options then compare them according to an objective function? They do, actually, evaluate each of possible output tokens, then emitting one of the most probable ones, but I think that concern is more about AI comparing larger chunks of text (for instance, evaluating paragraphs of a report by stakeholders' reaction).

Does the Solomonoff Prior Double-Count Simplicity?

Question: I've noticed what seems like a feature of the Solomonoff prior that I haven't seen discussed in any intros I've read. The prior is usually described as favoring simple programs through its exponential weighting term, but aren't simpler programs already exponentially favored in it just through multiplicity alone, before we even apply that weighting?

Consider Solomonoff induction applied to forecasting e.g. a video feed of a whirlpool, represented as a bit string . The prior probability for any such string is given by:

where  ranges over programs for a prefix-free Universal Turing Machine.

Observation: If we have a simple one kilobit program  that outputs prediction , we can construct nearly  different two kilobit programs that also output  by appending arbitrary "dead code" that never executes. 

For example:
DEADCODE="[arbitrary 1 kilobit string]"
[original 1 kilobit program ]
EOF
Where programs aren't allowed to have anything follow EOF, to ensure we satisfy the prefix free requirement.

If we compare  against another two kilobi... (read more)

Reply1111
5samshap
Yes, you are missing something. Any DEADCODE that can be added to a 1kb program can also be added to a 2kb program. The net effect is a wash, and you will end up with a 21000 ratio over priors
4Lucius Bushnaq
Why aren’t there 2^{1000} less programs with such dead code and a total length below 10^{90} for p_2, compared to p_1?
1samshap
There are, but what does having a length below 10^90 have to do with the solomonoff prior? There's no upper bound on the length of programs.
3Dalcy
https://www.lesswrong.com/posts/KcvJXhKqx4itFNWty/k-complexity-is-silly-use-cross-entropy-instead However:
6Lucius Bushnaq
Sure. But what’s interesting to me here is the implication that, if you restrict yourself to programs below some maximum length, weighing them uniformly apparently works perfectly fine and barely differs from Solomonoff induction at all. This resolves a remaining confusion I had about the connection between old school information theory and SLT. It apparently shows that a uniform prior over parameters (programs) of some fixed size parameter space is basically fine, actually, in that it fits together with what algorithmic information theory says about inductive inference.
3harfe
I think you are broadly right. But note that under the Solomonoff prior, you will get another 2−2000−|G| penalty for these programs with DEADCODE. So with this consideration, the weight changes from 2−1000 (for normal p1) to 2−1000(1+2−|G|) (normal p1 plus 21000 DEADCODE versions of p1), which is not a huge change. For your case of "uniform probability until 1090" I think you are right about exponential decay.
4Lucius Bushnaq
Yes, my point here is mainly that the exponential decay seems almost baked into the setup even if we don't explicitly set it up that way, not that the decay is very notably stronger than it looks at first glance. Given how many words have been spilled arguing over the philosophical validity of putting the decay with program length into the prior, this seems kind of important?
3Richard_Kennaway
The number of programs of length at most n increases exponentially with n. Therefore any probability measure over them must decrease at least exponentially with length. That is, exponential decay is the least possible penalisation of length. This is also true of the number of minimal programs of length at most n, hence the corresponding conclusion. (Proof: for each string S, consider the minimal program that writes S and halts. These programs are all different. Their sizes are no more than length(S)+c, where c is the fixed overhead of writing a program with S baked into it. Therefore exponentiality.) I've written "at most n" instead of simply "n", to guard against quirks like a programming language in which all programs are syntactically required to e.g. have even length, or deep theorems about the possible lengths of minimal programs.

Has anyone thought about how the idea of natural latents may be used to help formalise QACI

The simple core insight of QACI according to me is something like: A formal process we can describe that we're pretty sure would return the goals we want an AGI to optimise for is itself often a sufficient specification of those goals. Even if this formal process costs galactic amounts of compute and can never actually be run, not even by the AGI itself. 

This allows for some funny value specification strategies we might not usually think about. For example, we could try using some camera recordings of the present day, a for loop, and a code snippet implementing something like Solomonof induction to formally specify the idea of Earth sitting around in a time loop until it has worked out its CEV

It doesn't matter that the AGI can't compute that. So long as it can reason about what the result of the computation would be without running it, this suffices as a pointer to our CEV. Even if the AGI doesn't manage to infer the exact result of the process, that's fine so long as it can infer some bits of information about the result. This just ends up giving the AGI some moral uncerta... (read more)

I do think natural latents could have a significant role to play somehow in QACI-like setups, but it doesn't seem like they let you avoid formalizing, at least in the way you're talking about. It seems more interesting in terms of avoiding specifying a universal prior over possible worlds, if we can instead specify a somewhat less universal prior that bakes in assumptions about our worlds' known causal structure. it might help with getting a robust pointer to the start of the time snippet. I don't see how it helps avoiding specifying "looping", or "time snippet", etc. natural latents seem to me to be primarily about the causal structure of our universe, and it's unclear what they even mean otherwise. it seems like our ability to talk about this concept is made up of a bunch of natural latents, and some of them are kind of messy and underspecified by the phrase, mainly relating to what the heck is a physics.

6Lucius Bushnaq
That's mainly what I meant, yes. Specifying what the heck a physics is seems much more tractable to me.We don't have a neat theory of quantum gravity, but a lattice simulation of quantum field theory in curved space-time, or just a computer game world populated by characters controlled by neural networks, seems pretty straightforward to formally specify. We could probably start coding that up right now.  What we lack is a pointer to the right initial conditions for the simulation. The wave function of Earth in case of the lattice qft setup, or the human uploads as neural network parameters in case of the game environment.  

To me kinda the whole point of QACI is that it tries to actually be fully formalized. Informal definitions seem very much not robust to when superintelligences think about them; fully formalized definitions are the only thing I know of that keep meaning the same thing regardless of what kind of AI looks at it or with what kind of ontology.

I don't really get the whole natural latents ontology at all, and mostly expect it to be too weak for us to be able to get reflectively stable goal-content integrity even as the AI becomes vastly superintelligent. If definitions are informal, that feels to me like degrees of freedom in which an ASI can just pick whichever values make its job easiest.

Perhaps something like this allows use to use current, non-vastly-superintelligent AIs to help design a formalized version of QACI or ESP which itself is robust enough to be passed to superintelligent optimizers; but my response to this is usually "have you tried first formalizing CEV/QACI/ESP by hand?" because it feels like we've barely tried and like reasonable progress can be made on it that way.

Perhaps there are some cleverer schemes where the superintelligent optimizer is pointed at the weaker cur... (read more)

8Lucius Bushnaq
The idea would be that an informal definition of a concept conditioned on that informal definition being a pointer to a natural concept, is ≈ a formal specification of that concept. Where the ≈ is close enough to a = that it'd hold up to basically arbitrary optimization power.
4Tamsin Leake
So the formalized concept is Get_Simplest_Concept_Which_Can_Be_Informally_Described_As("QACI is an outer alignment scheme consisting of…") ? Is an informal definition written in english? It seems like "natural latent" here just means "simple (in some simplicity prior)". If I read the first line of your post as: It sure sounds like I should read the two posts you linked (perhaps especially this one), despite how hard I keep bouncing off of the natural latents idea. I'll give that a try.
6Lucius Bushnaq
More like the formalised concept is the thing you get if you poke through the AGI’s internals searching for its representation of the concept combination pointed to by an english sentence plus simulation code, and then point its values at that concept combination.
4Tamsin Leake
Seems really wonky and like there could be a lot of things that could go wrong in hard-to-predict ways, but I guess I sorta get the idea. I guess one of the main things I'm worried about is that it seems to require that we either: * Be really good at timing when we pause it to look at its internals, such that we look at the internals after it's had long enough to think about things that there are indeed such representations, but not long enough that it started optimizing really hard such that we either {die before we get to look at the internals} or {the internals are deceptively engineered to brainhack whoever would look at them}. If such a time interval even occurs for any amount of time at all. * Have an AI that is powerful enough to have powerful internals-about-QACI to look at, but corrigible enough that this power is not being used to do instrumentally convergent stuff like eat the world in order to have more resources with which to reason. Current AIs are not representative of what dealing with powerful optimizers is like; when we'll start getting powerful optimizers, they won't sit around long enough for us to look at them and ponder, they'll just quickly eat us.
3Jonas Hallgren
In natural langage maybe it would be something like "given these ontological boundaries, give us the best estimate you can of CEV. "? It seems kind of related to boundaries as well if you think of natural latents as "functional markov blankets" that cut reality at it's joints then you could probably say that you want to perserve part of that structure that is "human agency" or similar. I don't know if that makes sense but I like the idea direction!
1Daniel C
I think the fact that natural latents are much lower dimensional than all of physics makes it suitable for specifying the pointer to CEV as an equivalence class over physical processes (many quantum field configurations can correspond to the same human, and we want to ignore differences within that equivalence class). IMO the main bottleneck is to account for the reflective aspects in CEV, because one constraint of natural latents is that it should be redundantly represented in the environment.
2Lucius Bushnaq
It is redundantly represented in the environment, because humans are part of the environment. If you tell an AI to imagine what happens if humans sit around in a time loop until they figure out what they want, this will single out a specific thought experiment to the AI, provided humans and physics are concepts the AI itself thinks in. (The time loop part and the condition for terminating the loop can be formally specified in code, so the AI doesn't need to think those are natural concepts) If the AI didn't have a model of human internals that let it predict the outcome of this scenario, it would be bad at predicting humans.  
1Daniel C
natural latents are about whether the AI's cognition routes through the same concepts that humans use. We can imagine the AI maintaining predictive accuracy about humans without using the same human concepts. For example, it can use low-level physics to simulate the environment, which would be predictively accurate, but that cognition doesn't make use of the concept "strawberry" (in principle, we can still "single out" the concept of "strawberry" within it, but that information comes mostly from us, not from the physics simulation) Natural latents are equivalent up to isomorphism (ie two latent variables are equivalent iff they give the same conditional probabilities on observables), but for reflective aspects of human cognition, it's unclear whether that equivalence class pin down all information we care about for CEV (there may be differences within the equivalence class that we care about), in a way that generalizes far out of distribution
3Lucius Bushnaq
My claim is that the natural latents the AI needs to share for this setup are not about the details of what a 'CEV' is. They are about what researchers mean when they talk about initializing, e.g., a physics simulation with the state of the Earth at a specific moment in time.
1Daniel C
Noted, that does seem a lot more tractable than using natural latents to pin down details of CEV by itself

Two shovel-ready theory projects in interpretability.

Most scientific work isn't "shovel-ready." It's difficult to generate well-defined, self-contained projects where the path forward is clear without extensive background context. In my experience, this is extra true of theory work, where most of the labour if often about figuring out what the project should actually be, because the requirements are unclear or confused.

Nevertheless, I currently have two theory projects related to computation in superposition in my backlog that I think are valuable and that maybe have reasonably clear execution paths. Someone just needs to crunch a bunch of math and write up the results. 

Impact story sketch: We now have some very basic theory for how computation in superposition could work[1]. But I think there’s more to do there that could help our understanding. If superposition happens in real models, better theoretical grounding could help us understand what we’re seeing in these models, and how to un-superpose them back into sensible individual circuits and mechanisms we can analyse one at a time. With sufficient understanding, we might even gain some insight into how circuits develop duri... (read more)