I think a contentious assumption you're making with this model is the value-neutral core of mesaoptimizer cognition, namely your mesaoptimize
in the pseudocode. I think that our whole problem in practice is roughly that we don't know how to gradient-descend our way toward general cognitive primitives that have goals factored out.
A different way to point at my perceived issue: the mesaoptimizers are built out of a mesaoptimize
primitive, which is itself is a mesaoptimizer that has to be learnt. This seems to me to be not well-founded, and not actually factoring a mesaoptimizer into smaller parts.
I think my argument only gets stronger if you assume that the mesaobjective is a large pile of heuristics built into the mesaoptimization algorithm, since that takes up much more space.
In the traditional deceptive mesaoptimization story, the model needs to at some point switch from "pursuing objective X directly" to "pursuing objective Y indirectly by deceptively pursuing objective X". I agree that, if there isn't really a core "mesaoptimizer" that can have goals swapped out, the idea of seemlessly transitioning between the two is very unlikely, since you initially lack the heuristics for "pursuing objective Y".
I'm not sure whether you're arguing that my post fails to imply the speed prior disincentivizes deceptive mesaoptimization, or whether you're arguing that deceptive mesaoptimization isn't likely in the first place.
A longer reply on the points about heuristic mesaobjectives and the switch:
I will first note here that I'm not a huge fan of the concepts/story from the mesaoptimizers paper as a way of factoring reality. I struggle to map the concepts onto my own model of what's going to happen as we fumble toward AGI.
But putting that aside, and noting that my language is imprecise and confused, here is how I think about the "switch" from directly to deceptively pursuing your training objective:
So an example transition from pursing X "directly" to "deceptively" would be an agent you train to keep your diamonds safe, that eventually learns that you're judging this via cameras, and will therefore take actions that fool the cameras if they become available.
And notably I don't think your argument applies to this class of example? It at least doesn't seem like I could write down a speed prior that would actually reassure me that my diamond-keeper won't try to lie to me.
I think (3) is not the same as my definition of deception. There are two independent concepts from the Xu post: "deceptive misaligned mesaoptimizers" and "nondeceptive misaligned mesaoptimizers".
(3) seems to be describing ordinary misaligned mesaoptimizers (whose proxies no longer generalize on the test distribution).
I think an agent that you train to keep your diamond safe that learns you're judging it from cameras may indeed take actions to fool the cameras, but I don't think it will secretly optimize some other objective while it's doing that. I agree my argument doesn't apply to this example.
And a final note: none of that seems to matter for my main complaint, which is that the argument in the post seems to rely on factoring "mesaoptimizer" as "stuff + another mesaoptimizer"?
If so, I can't really update on the results of the argument.
I don’t think it relies on this, but I’m not sure where we’re not seeing eye to eye.
You don’t literally need to be able to factorize out the mesaoptimizer - but insofar as there is some minimum space needed to implement any sort of mesaoptimizer (with heuristics or otherwise), this argument applies to whichever size mesaoptimizer’s tendency to optimize a valid proxy vs. deceptively optimize a proxy to secretly achieve something completely different.
Two quick things to say:
(1) I think the traditional story is more that your agent pursues mostly-X while it's dumb, but then gradient descent summons something intelligent with some weird pseudo-goal Y, because this can be selected for when you reward the agent for looking like it pursues X.
(2) I'm mainly arguing that your post isn't correctly examining the effect of a speed prior. Though I also think that one or both of us are confused about what a mesaoptimizer found by gradient-descent would actually look like, which matters lots for what theoretical models apply in reality.
I very much do not believe that a mesaoptimizer found by gradient descent would look anything like the above Python programs. I'm just using this as a simplification to try and get at trends that I think it represents.
Re: (1) my argument is exactly whether gradient descent would summon an agent with a weird pseudogoal Y that was not itself a proxy for reward on its training distribution. If pursuing Y directly (which is different from the base optimizer goal, e.g. Z)
I'm realizing some of the confusion might be because I named the goal-finding function "get_base_obj" instead of "get_proxy_for_base_obj". That seems like it would definitely mislead people, I'll fix that.
My impression of what you're saying at a high level is that deception takes more computation than just doing the thing, so if we can optimize for doing the thing with minimal computation, we won't get deception. Is that right?
On a related note I recall that there's some evidence that lying is cognitively expensive in humans.
I'm confused by your Fast Deceptive Mesaoptimiser example. How does the speed prior come in here? It isn't mentioned in the description.
This is very interesting! A few thoughts/questions:
"we don't currently know how to differentiably vary the size of the NN being run. We can certainly imagine NNs being rolled-out a fixed number of times (like RNNs), where the number of rollouts is controllable via a learned parameter, but this parameter won't be updateable via a standard gradient."
Is this really true? I can think of a way to do this in a standard gradient type way.
Also there looks like there is a paper by someone who works in ML from 2017 where they do this https://arxiv.org/abs/1603.08983
TLDR at each roll out have a neuron that represents the halting probability and then make the result of the roll out the sum of the output vectors at each rollout weighted by the probability the network halted at that rollout.
Interesting! I think this might not actually enforce a prior though, in the sense that the later-stages of the network can just scale up their output magnitudes to compensate for the probability-based dampening.
Getting massively out of my depth here, but is that an easy thing to do given the later stages will have to share weights with early stages?
I'm not sure, but I could imagine an activation representing a counter of "how many steps have I been thinking for" is a useful feature encoded in many such networks.
Just want to point to a more recent (2021) paper implementing adaptive computation by some DeepMind researchers that I found interesting when I was looking into this:
I think speed and complexity trade off against model's precision? Like, for simple objectives you can hardcode it, but for complex one you wouldn't have space for base objective that you couldn't spend on table of chemical elements or something. So I would expect t(mesaoptimize_env) for the base objective to be only slighter greater than t(is_deployed_in_world) + t(find(get_base_obj(env))) + t(mesaoptimize_env) for a mesaoptimizer that hardcodes a check, but computes base objective, because I expect t(find(get_base_obj(env))) to be mostly contained in t(mesaoptimize_env). Hmmm, except, maybe if you somehow specifically penalize difference in speed for instrumental vs terminal objective? But then mesaoptimizer would just find other instrumental ways.
I think you might be conflating two different scenarios? t(is_deployed_in_world) and t(find(get_base_obj)) are not simultaneously present in either case.
That said, I'll still do my best to respond. By definition, t(find(get_base_obj(env))) is whatever part of t(mesaoptimize_env) that doesn't have to do. Is your claim that that's basically nothing? If it's always super easy to identify the base objective and then optimize it, then shouldn't need to pay the penalty of storing c(get_base_obj), since it could also near-instantly derive the base objective. In that case, both and still lose due to needing to pay for c(simple_objective) and, respectively, c(is_deployed_in_world) or t(mesaoptimize_outer_env).
I'm not saying these costs are exorbitant, but they are strictly greater than not paying them - so the prior would never incentivize paying them, and always slightly disincentivize doing so. (I.e. they'd never arise naturally from optimization pressure alone.)
I think you might be conflating two different scenarios?
I present alternative strategy for a mesaoptimizer that, yes, wasn't in the post, by I don't see why?
Is your claim that that’s basically nothing?
Yes, or at least it approaches relatively nothing as we get more competent optimizer.
If it’s always super easy to identify the base objective and then optimize it, then shouldn’t need to pay the penalty of storing c(get_base_obj), since it could also near-instantly derive the base objective.
If it doesn't store it, it isn't - it would be able to derive that humans want it, but wouldn't want to optimize it itself.
What would it want to optimize, then, according to you? I’m claiming that “I want to figure out what the environment wants and then do it” is a simpler goal than “I want X, to get that I’m going to figure out what the environment wants and then do it”
Re using both, you’re right, if you make your other assumptions then using both could work. (Though it does still have higher simplicity cost.)
I mean, "I want X" is shorter than "I want to figure out what the environment wants" even in english^^. Complexity depends on language, so I'm not sure what would be promoted by real gradient descent, but something like "maximize knowledge" seems simpler than "maximize knowledge and don't kill humans". Maybe it will resemble base objective more if mesaoptimiser discovers how to stabilize it's own utility function earlier, maybe it will depend on what is efficiently encodable directly in weights. But if you need to encode procedure for figuring out the environment anyway, there is no reason for the base objective to be encodable as a shorter additional code - you would more likely already have functions for estimating energy that you can reuse than functions for estimating parts of base objective.
Though it does still have higher simplicity cost.
The difference in complexity is c(get_base_obj) vs c(simplest_obj) + c(is_deployed_in_world), right? So if base objective is more complex, then deceiving is simpler.
I think we're talking past each other. The difference seems to me to be that you think there is no efficient encoding of human-ish values that is more efficient than backsolving whatever current subset of humans values are required in the current environment, whereas I think that given sufficiently diverse environments requiring the exhibiting of different human values, a fixed encoding is actually most efficient.
For example, imagine an environment where you are a chatbot rewarded for positive customer service interactions. An agent with the super-reductive version of human values which is just "do whatever it seems like humans want" wakes up, sees humans seem to want cordial customer service, and does that.
On the other hand, an agent which wants to make paperclips wakes up, sees that it can't make many paperclips right now but that it may be able to in the future, realizes it needs to maximize its current environment, rederives that it should do whatever its overseers want, and only then sees humans seem to want cordial customer service, and then does that.
Either both or neither of the agents get to amortize the cost of "see humans seem to want cordial customer service". Certainly if it's that easy to derive "what the agent should do in every environment", I don't see why a non-deceptive mesaoptimizer wouldn't benefit from the same strategy.
If your claim is "this isn't a solution to misaligned mesaoptimizers, only to deceptive misaligned mesaoptimizers", then yes absolutely I wouldn't claim otherwise.
Oh, yes, I actually missed that this was not supposed to solve misaligned mesaoptimizers because of "well-aligned" in "Fast Honest Mesaoptimizer: The AI wakes up in a new environment, and proceeds to optimize a proxy that is well-aligned with the environment’s current objective.". Rechecking with new context... Well, not sure if it's new context, but I now see that optimizer with check that derives what humans want should be the same as honest one if the check is never satisfied and so it would have at least the same complexity, which I neglected because I didn't think what "it proceeds to optimize the objective it’s supposed to" means in detail. So you're right, it's either slower or more complex.
I've been thinking about the tradeoff between speed, simplicity and generality. I tentatively think that there's some sort of pareto optimal relationship between speed, simplicity and generality. I think the key is to view a given program as representing an implicit encoding of the distribution of inputs that the program correctly decides. As the program generalizes, its distribution of correctly decided inputs expands. As the program becomes simpler, its capacity to specify an efficient encoding decreases. Both these factors point towards the program taking longer to execute on randomly sampled inputs from its input distribution.
You can see a presentation about my current intuitions which I prepared for my PhD advisor: https://docs.google.com/presentation/d/1JG9LYG6pDt7T6WcU9vpU6pqwSR5xjcaEJmKAudRtNPs/edit?usp=sharing
I hope to have a full write up once I'm more certain about how to think about these the arguments in the presentation. I'm pretty unsure about this line of thought, but I think it might be moving towards some pretty important insights regarding simplicity vs speed vs generality. I'd be happy for any feedback you have!
One thing I feel pretty confidant about is that pure speed prior does not generalize (like at all). I think it just gives you a binary search tree lookup table containing only your training data.
I certainly agree that the most extreme version of the speed prior is degenerate - but given that you need to trade off speed and complexity under a finite compute budget, the simplest-yet-slow algorithm will often perform much worse than a slightly-more-complicated-yet-faster algorithm.
For example, consider the fact that reasoning abstractly about how to play Go is a lot worse than learning really good heuristics for playing Go. Those heuristics are indeed kind of a step towards a "lookup table" - but that's ok sometimes, given that thinking forever (e.g. heuristic-free MCTS) will find much worse solutions in a given time budget.
You might already know this, but the idea of perfect generality on arbitrary tasks seems like you're basically asking for Solomonoff induction - in which case you might reformulate your point as saying "the longer you run Solomonoff induction, the more general the class of problems you can solve". I think this is true in the limit, but for any finite-complexity distribution of training environments, you do eventually hit diminishing returns from running your Solomonoff inductor for longer.
I think there’s a continuum of speed vs generality. You can add more complexity to improve speed, but it comes at the cost of generality (assuming all algorithms are implemented in an equally efficient manner).
Given a finite compute budget, it’s definitely possible to choose an algorithm that’s more general than your budget can afford. In that case, you hit under fitting issues. Then, you can find a more complex algorithm that’s more specialized to the specific distribution you’re trying to model. Such an algorithm is faster to run on your training distribution but is less likely to generalize to other distributions.
The overall point of the presentation is that such a tendency isn’t just some empirical pattern that we can only count on for algorithms developed by humans. It derives from an information-theoretic perspective on the relation between algorithmic complexity, runtime and input space complexity.
Edit: to clarify an implicit assumption in my presentation, I’m assuming all algorithms are run for as long as needed to correctly decide all inputs. The analysis focuses on how the runtime changes as the simplicity increases and the input space widens.
pareto optimal relationship between speed, simplicity and generality
This is an interesting subject. I think that the average slope of the speed-simplicity frontier might give a good measure of the complexity of an object, specifically of the number of layers of emergent behavior that the object exhibits.
Thanks to Evan Hubinger for the extensive conversations that this post is based on, and for reviewing a draft.
This post is going to assume familiarity with mesa-optimization - for a good primer, check out Does SGD Produce Deceptive Misalignment by Mark Xu.
Deceptive inner misalignment is the situation where the agent learns a misaligned mesaobjective (different from the base objective we humans wanted) and is sufficiently "situationally aware" to know that unless it deceives the training process by pretending to be aligned, gradient descent may alter its mesaobjective.
There are two different reasons that an AI model could become a deceptive mesaoptimizer:
In this post, I'll focus on the "malign priors" argument, and why I think a well-tuned speed prior can largely prevent it.
Why does this matter? Well, if deceptive inner misalignment primarily occurs due to path dependence, that implies that ensuring inner alignment can be reduced to the problem of ensuring early-training inner alignment - which seems a lot more tractable, since this is before the model enters the "potentially-deceptive" regime.
First, why would anyone think (2) was actually likely enough to justify studying it? I think the best reason is that by studying these pressures in the limit, we can learn lessons about the pressures that exist on the margin. For example, say we have an objective B that is perfectly-aligned on the training data, and there's a very-slightly-simpler objective B′ that is slightly worse on the training distribution. We might ask the question: is SGD likely to push B to become B′, and compensate for the reduced accuracy of directly optimizing B′ via deceptively optimizing B on the training data? I think this post provides us with tools to directly analyze this possibility. (If you buy the rest of the post, then with a sufficient speed + simplicity prior, the answer is that B will stay favored over B′. That's good!)
Priors on Learned Optimizers
Let's talk about priors!
We can think of large neural networks as basically implementing short programs, and the process of "training" an NN is just searching through the space of programs until we find one that does well on our target task.
We assume that if two NN-programs have equal performance, the training process will usually pick the one favored on priors/inductive biases.
There are several different types of priors that neural networks might have:
In this post, I will focus on the speed prior, and why I think it generally disfavors deceptive mesaoptimization. Importantly, the speed prior never exists in practice on its own (since NN training has a built-in simplicity prior), so our focus is actually going to be on the speed + simplicity prior.
Two variants of deceptive mesa-optimizers
When people think of deceptive mesaoptimization, they tend to think of two different program structures:
We can contrast both of these with the "honest" mesaoptimizer structure:
We'll assume that the training process naturally pushes the NN to discover its existence within a training process and outside world. This seems reasonably likely (e.g. if this knowledge helps predict unobserved parts of the environment, or clarify ambiguity in the reward).
We can write these different mesaoptimizers down explicitly:
(Note these are the cartoonishly-simple version of mesaoptimizers; in practice, they'd likely have all sorts of heuristics and policies to improve search speed. However, I believe these arguments only get stronger if you add those in, since they effectively multiply both the speed and complexity costs.)
Let Hfh refer to the fast non-deceptive mesaoptimizer, Hsd refer to the simple/slow deceptive mesaoptimizer, and Hfd refer to the fast deceptive mesaoptimizer.
We'll now compare these different possibilities along our two axes of priors: speed, and simplicity. Let c(x) be the description length of x, and t(x) be the time-cost of x (inverse of speed).
c(Hfh)=c(get_base_obj)+c(mesaoptimize)+c(useful_knowledge)
c(Hsd)=c(simplest_obj)+c(combine)+c(mesaoptimize)+c(useful_knowledge)
c(Hfd)=c(get_base_obj)+c(simplest_obj)+c(is_deployed_in_world)+c(mesaoptimize)+c(useful_knowledge)
Thus, if c(get_base_obj)>c(simplest_obj)+c(combine), then c(Hfh)>c(Hsd) and Hsd is favored over the honest mesaoptimizer according to a simplicity prior. In the other case, c(Hfh)<c(Hfd). This may not be a huge difference: c(simplest_obj) may be the simplest possible objective and thus negligible, and c(is_deployed_in_world) might not be that complicated - but it is at least slightly more complex.
What about the speed prior?
t(Hfh)=t(get_proxy_for_base_obj)+t(mesaoptimize_env)
t(Hsd)=t(combine)+t(find(get_proxy_for_base_obj(env)))+t(mesaoptimize_env) +t(mesaoptimize_outer_world_env)
(t(find(get_base_obj(env))) is the time it takes for the mesaoptimizer to rederive the knowledge necessary to figure out the base objective, vs. what Hfh and Hfd needed to store in order to derive the base objective more quickly, i.e. the complexity of the function c(get_base_obj) )
t(Hfd)=t(is_deployed_in_world)+t(get_proxy_for_base_obj)+t(mesaoptimize_env)
Here we see the opposite relationship! It's undoubtedly true that t(Hfh)<t(Hsd), since Hsd must rederive both its overall strategy for escaping, and also figure out which mesaobjective to optimize in the current environment, every episode. At the same time, Hfh is just slightly faster than Hfd, since it doesn't need to perform the train-or-test check.
So, given both a speed and simplicity prior, and an "exchange rate" g between the two of g(c(x))=t(x), how do the different mesaoptimizers compare?
Hfd loses to Hfh on both the speed and simplicity priors (albeit barely for both), implying that RSA-2048-style deception is not going to arise naturally during the training process. (That said, it could still occur via the path-dependence argument for deceptive mesaoptimization, plus self-modification e.g. via gradient hacking.)
Hsd beats Hfh on the simplicity prior, but loses on the speed prior. Thus, which of these two is favored by training depends on the relative strength of the speed vs. simplicity priors being applied, which we'll denote with the exchange rate function g[1]?
At a minimum, Hsd is favored over Hfh only if g(c(get_proxy_for_base_obj))<s(find(get_proxy_for_base_obj(env))). We have dropped several terms here all of which favor Hfh, but even this term alone may be sufficient to favor Hfh, assuming a sufficiently strong speed prior relative to the simplicity prior (defined by g).
If we can get g within a certain reange, we can force the training process's priors to favor non-deceptive mesaoptimization. We'll talk about how in the last section.
What is g likely to be in practice? We can reframe the argument as follows: is the cost of the time it takes for a mesaoptimizer to rediscover a fact, smaller or larger than the complexity-cost of encoding that fact?
Given a useful fact, is it "cheaper" to pay the complexity cost to store that fact, or pay the time-cost to rederive it via search?
I want to walk through my intuition for why for most reasonable values of g, the complexity cost of storage is lower than the equivalent time cost of derivation:
If we can in fact make g favor faster programs enough, then this suggests that we won't need to worry about deceptive mesaoptimizers being favored on priors!
That said, the value of g depends at least in part on the choices we make in NN architecture design/loss functions.
How ML architecture choice affects the speed-simplicity tradeoff
First, the simplicity prior naturally occurs in all search algorithms, and does seem particularly selected for in SGD. (Some people claim NN "grokking" is this SGD discovering an even-simpler explanation for the data.)
What determines the speed prior? In practice, NNs run for a fixed number of layers - we don't currently know how to differentiably vary the size of the NN being run. We can certainly imagine NNs being rolled-out a fixed number of times (like RNNs), where the number of rollouts is controllable via a learned parameter, but this parameter won't be updateable via a standard gradient.
Instead, in practice, the speed prior is naturally enforced by a two step procedure:
Is the speed prior natural?
Why does the speed prior exist at all? In practice, AI developers only have a finite time-budget for any project (or alternatively, a finite compute budget). If the total time is T, and they increase the number of rollouts-per-episode from k to k+1, they've decreased the total number of episodes they get to learn on by Tk(k+1). This tradeoff forces AI developers to be fairly judicious with the strength of their speed prior.
In a similar vein, for long-term planners like simple deceptive mesaoptimizers, there may be diminishing returns to thinking longer and longer about plans. Jones (2021) looks at scaling laws for planning and finds that for simple boardgames, using twice as much compute (roughly, running search for twice as long) as your opponent improves your win-rate just to 23:13. For a fixed opponent (e.g. a given environment), this implies diminishing returns to longer and longer planning.
Another argument for the speed prior: hardware scaling is no longer making chips faster, only capable of more parallel instructions, so it seems likely we're stuck with our modern constraints on "sequential depth of reasoning per unit time" for the foreseeable future.
That said, an overly strong speed prior may also have real costs, which Evan goes into in this post:
I tend to think there's a natural amount of speed prior that doesn't much hurt performance and substantially increases the number of outer optimization samples you can iterate over.
How to actually implement a speed prior
If AI designers want to train on more episodes and save time/compute resources, it makes sense for them to select the maximum speed prior that allows for good completion of the task. By the previous arguments, this is likely to disfavor the "simple deceptive" mesaoptimizer.
Specifically, here is a training strategy that would tend to avoid malign-priors mesaoptimizers:
A few final points:
This is kind of abusing notation - in practice, g will probably not be factorable this way, i.e. g(s(a)+s(b))≠g(s(a))+g(s(b)), but it's helping me convey the intuition for now.