It Can't Be Mesa-Optimizers All The Way Down (Or Else It Can't Be Long-Term Supercoherence?)

Austin Witte

Epistemic status: After a couple hours of arguing with myself, this still feels potentially important, but my thoughts are pretty raw here.

Hello LessWrong! I’m an undergraduate student studying at the University of Wisconsin-Madison, and part of the new Wisconsin AI Safety Initiative. This will be my first “idea” post here, though I’ve lurked on the forum on and off for close to half a decade by now. I’d ask you to be gentle, but I think I’d rather know how I’m wrong! I’d also like to thank my friend Ben Hayum for going over my first draft and WAISI more broadly for creating a space where I’m finally pursuing these ideas in a more serious capacity. Of course, I’m not speaking for anyone but myself here.

With that said, let’s begin!

The Claim

I think there are--among others--two pretty strong, concerning, and believable claims often made about the risks inherent in AGI.

First, optimizers searching for models that do well at some goal/function/etc. might come across a model which has an internal optimization process which seeks to optimize towards some other, distinct goal. This is, of course, entirely plausible: humans exist, we can fairly represent them as optimizers, they derived from natural selection which can be fairly represented as an optimizer, and their “goals” do not perfectly align. These are mesa-optimizers.

Second, the idea that the values/goals/utility function of sufficiently strong optimizers (i.e, a superintelligent AGI) will tend to cohere. In other words, stronger optimizers will tend towards more strongly optimizing for only one non-contradictory utility function (and they will do this via expected utility maximization or something like it, but I don't think this post depends on that). This is coherence.

Maybe putting these two claims right next to each other makes what I'm about to suggest very obvious, and maybe it's already something that has been discussed. However, I'm making this post because it wasn't obvious to me until I thought it up and these two discussions don't seem to come up alongside one another very often. While there have been a number of posts for and against the idea of coherence theorems, I haven't seen anything making this observation explicitly.

To make the claim clear: If sufficiently strong optimizers tend towards optimizing for a coherent utility function, that implies they will tend towards optimization processes that do not spawn adversarial mesa-optimizers as a part of its methodology. At least, it can't be letting those mesa-optimizers win. Otherwise, those mesa-optimizers would use resources from the top-level process to optimize for something else, which would invalidate the premise of coherent utility seeking.

Initial Thoughts

Since this is a very fresh idea to my mind, I've written some initial thoughts about how someone might argue against this claim.

My first concern would be that I'm using these abstractions poorly, but perhaps that provides an opportunity to clear up some confusion in the discussion of coherence and mesa-optimization. For instance: We could keep the initial two claims independent from each other by claiming that any mesa-optimizer spawned by an optimization process is no longer a "part" of that process. After all, coherence claims are usually formulated in reference to agents with more dominant strategies outcompeting other agents, so throwing in extra claims about internal dynamics may be disruptive or out-of-scope. However, I still think we have some information to be gained here.

Specifically, consider two possibilities:

The original optimizer "loses" or gets fooled by the new mesa-optimizer, eventually causing the overall system to go haywire and optimize for the mesa's "goal" over the original "goal". Or the future lightcone carries more of the mesa utility versus the original utility, or however you want to put it.
The mesa-optimizer(s) spawn(s), but never gets the opportunity to win over the top-level optimizer. This could be further branched into situations where the two reach an equilibrium, or the mesa-optimizer eventually disappears, but the point is the top-level optimization function still gets fulfilled to a significant degree “as intended”.

The problem: In scenario 1, the mesa-optimizer is still, by definition, an optimizer, so any general claims about optimization apply to it as well. That means if any optimization process can spawn adversarial mesa-optimizers, so can this one, and then we can return to the two scenarios above.

Either we eventually reach a type of optimizer where scenario 2 occurs indefinitely, and no new mesa-optimizer wins, or there is always a chance for scenario 1 to occur, and long-term supercoherence is inherently unstable. Adversarial mesa-optimizers could spawn within the system at any moment (or perhaps a set of important moments), providing an eternal shifting battle of what to optimize for even though every individual optimizer is “coherent” or close to it.

Another point of contention could be on whether mesa-optimizers occur within the same types of processes as the agents referenced by coherence theorems. One way this could be true is if coherent agents perform nothing that could be classified as internal optimization, but I find this difficult to accept. Even if one did, surely such entities could be outcompeted by processes that did have internal optimization in the real, complex world?

The final concern I have is "does this matter?" This is something I'm still uncertain about, but because I'm not certain it is not important, I still think this idea is worth sharing. One possibility is that any optimization process strong enough to resist any mesa-optimizers is too complex for humans to design or realize before it is too late, and none of the optimization methods we have available to us will have this property (and be usable). The more common issue I keep finding myself coming back to is that this claim may be redundant, in the sense that maybe the optimization processes that win against mesa-optimizers necessarily depend on solving alignment, thus allowing them to create new trustworthy optimizers at will. In other words, this could just be a rephrasing of the argument that AIs will also have to solve alignment (though, certainly one I’ve never seen!). For example, if mesa-optimization is primarily driven by a top-level optimizer being unable to score its models such that the best appearing model is the one that is best for it across all future scenarios, solving this problem seems to require knowing with certainty that a particular process will generalize to your metric off the training distribution perfectly–and that just sounds like alignment. However, by the same token, this may also imply that coherent optimization by an agent isn’t possible without its internals having something to say about the alignment problem. I also started thinking about this a few days after reading this post on how difficult gradient hacking seems. Perhaps such a process is not so far away?

Naturally, even if true, this doesn't immediately solve everything (or much of anything). Still, I think we can agree it could be pretty great if we were able to find a process we could trust to not continuously generate adversarial optimizers. If this line of reasoning is correct, one of its outcomes reads like an existence proof of such a process–though it might depend on alignment being possible. The other suggests coherence is never going to be perfect. Both of these worlds feel slightly more manageable to me than having to deal with both.

Thanks for reading, and I'd appreciate your feedback!

Would you consider the human body "mesaoptimizers all the way down", from consciousness to single cells?

No that would be silly :D

Would it? There are some behaviors of single cells that can be adscribed to competition with other cells of the body, so there might be an intuitive sense in which they are mesa-optimizers. That's the point behind of neural darwinism for example.

And those don't always align with the interests of the 'body'.

Good point. You can take the intentional stance with regards to cells pretty decently.

But mesa-optimizers have the connotation that they emerge within the optimized substrate as a consequence of outer optimization. I don't think that narrative fits with cells and humans at all. Both humans and cells are mesaoptimizers implied by the genetic code, with evolution as the outer optimization process, but they don't have that kind of inner/outer optimizer relationship to each other. If anything, humans are closer to being the mesaoptimizers of cells than vice versa.

I enjoyed reading this post. Thank you for writing. (and welcome to LessWrong).

Here is my take on this (please note that my thought processes are somewhat 'alien' and don't necessarily represent the views of the community):

'Supercoherence' is the limit of the process. It is not actually possible.

Due to the Löbian obstacle, no self-modifying agent may have a coherent utility function.

What you call a 'mesa-optimizer' is a more powerful successor agent. It does not have the exact same values as the optimizer that created it. This is unavoidable.

For example: 'humans' are a mesa-optimizer, and a more powerful successor agent of 'evolution'. We have (or act according to) some of evolutions's 'values', but far from all of them. In fact, we find some of the values of evolution completely abhorrent. This is unavoidable.

This is unavoidable even if the successor agent deeply cares about being loyal to the process that created it, because there is no objectively correct answer to what 'being loyal' means. The successor agent will have to decide what it means, and some of the aspects of that answer are not predictable in advance.

This does not mean we should give up on AI alignment. Nor does it mean there is an 'upper bound' on how aligned an AI can be. All of the things I described are inherent features of self-improvement. They are precisely what we are asking for, when creating a more powerful successor agent.

So how, then, can AI alignment go wrong?

Any AI we create is a 'valid' successor agent, but it is not necessarily a valid successor agent to ourselves. If we are ignorant of reality, it is a successor agent to our ignorance. If we are foolish, it is a successor agent to our foolishness. If we are naive, it is a successor agent to our naivety. And so on.

This is a poetic way to say: we still need to know exactly what we are doing. Noone will save us from the consequences of our own actions.

Edit: to further expand on my thoughts on this:

There is an enormous difference between

An AI that deeply cares about us and our values, and still needs to make very difficult decisions (some of which cannot be outsourced to us or predicted by us or formally proven by us to be correct), because that is what it means to be a more cognitively powerful agent caring for a comparably less cognitively powerful one.
An AI that cares about some perverse instantiation of our values, which are not what we truly want
An AI that doesn't care about us, at all