We've written a paper on online imitation learning, and our construction allows us to bound the extent to which mesa-optimizers could accomplish anything. This is not to say it will definitely be easy to eliminate mesa-optimizers in practice, but investigations into how to do so could look here as a starting point. The way to avoid outputting predictions that may have been corrupted by a mesa-optimizer is to ask for help when plausible stochastic models disagree about probabilities.
Here is the abstract:
In imitation learning, imitators and demonstrators are policies for picking actions given past interactions with the environment. If we run an imitator, we probably want events to unfold similarly to the way they would have if the demonstrator had been acting the whole time. No existing work provides formal guidance in how this might be accomplished, instead restricting focus to environments that restart, making learning unusually easy, and conveniently limiting the significance of any mistake. We address a fully general setting, in which the (stochastic) environment and demonstrator never reset, not even for training purposes. Our new conservative Bayesian imitation learner underestimates the probabilities of each available action, and queries for more data with the remaining probability. Our main result: if an event would have been unlikely had the demonstrator acted the whole time, that event's likelihood can be bounded above when running the (initially totally ignorant) imitator instead. Meanwhile, queries to the demonstrator rapidly diminish in frequency.
The second-last sentence refers to the bound on what a mesa-optimizer could accomplish. We assume a realizable setting (positive prior weight on the true demonstrator-model). There are none of the usual embedding problems here—the imitator can just be bigger than the demonstrator that it's modeling.
(As a side note, even if the imitator had to model the whole world, it wouldn't be a big problem theoretically. If the walls of the computer don't in fact break during the operation of the agent, then "the actual world" and "the actual world outside the computer conditioned on the walls of the computer not breaking" both have equal claim to being "the true world-model", in the formal sense that is relevant to a Bayesian agent. And the latter formulation doesn't require the agent to fit inside world that it's modeling).
Almost no mathematical background is required to follow [Edit: most of ] the proofs. [Edit: But there is a bit of jargon. "Measure" means "probability distribution", and "semimeasure" is a probability distribution that sums to less than one.] We feel our bounds could be made much tighter, and we'd love help investigating that.
These slides (pdf here) are fairly self-contained and a quicker read than the paper itself.






Below, and refer to the probability of the event supposing the demonstrator or imitator were acting the entire time. The limit below refers to successively more unlikely events ; it's not a limit over time. Imagine a sequence of events such that .



If the inner alignment problem did not exist for perfect Bayesians, but did exist for neural networks, then it would appear to be a regime where more intelligence makes the problem go away. If the inner alignment problem were ~solved for perfect Bayesians, but unsolved for neural networks, I think there's still some of the flavor of that regime, but we do have to be pretty careful to make sure we're applying the same sort of solution to the non-Bayesian algorithms. I think in Vanessa's comment above, she's suggesting this looks doable.
Note the method here of avoiding mesa-optimizers: error bounds. Neural networks don't have those. Naturally, one way to make mesa-optimizer-deceptively-selected-errors go away is just to have better learning algorithms that make errors go away. Algorithms like Gated Linear Networks with proper error bounds may be a safer building block for AGI. But none of this takes away from the fact that it is potentially important to figure out how to avoid mesa-optimization in neural networks, and I would add to your claim that this is a much harder setting; I would say it's a harder setting because of the non-existence of error bounds.