We've written a paper on online imitation learning, and our construction allows us to bound the extent to which mesa-optimizers could accomplish anything. This is not to say it will definitely be easy to eliminate mesa-optimizers in practice, but investigations into how to do so could look here as a starting point. The way to avoid outputting predictions that may have been corrupted by a mesa-optimizer is to ask for help when plausible stochastic models disagree about probabilities.
Here is the abstract:
In imitation learning, imitators and demonstrators are policies for picking actions given past interactions with the environment. If we run an imitator, we probably want events to unfold similarly to the way they would have if the demonstrator had been acting the whole time. No existing work provides formal guidance in how this might be accomplished, instead restricting focus to environments that restart, making learning unusually easy, and conveniently limiting the significance of any mistake. We address a fully general setting, in which the (stochastic) environment and demonstrator never reset, not even for training purposes. Our new conservative Bayesian imitation learner underestimates the probabilities of each available action, and queries for more data with the remaining probability. Our main result: if an event would have been unlikely had the demonstrator acted the whole time, that event's likelihood can be bounded above when running the (initially totally ignorant) imitator instead. Meanwhile, queries to the demonstrator rapidly diminish in frequency.
The second-last sentence refers to the bound on what a mesa-optimizer could accomplish. We assume a realizable setting (positive prior weight on the true demonstrator-model). There are none of the usual embedding problems here—the imitator can just be bigger than the demonstrator that it's modeling.
(As a side note, even if the imitator had to model the whole world, it wouldn't be a big problem theoretically. If the walls of the computer don't in fact break during the operation of the agent, then "the actual world" and "the actual world outside the computer conditioned on the walls of the computer not breaking" both have equal claim to being "the true world-model", in the formal sense that is relevant to a Bayesian agent. And the latter formulation doesn't require the agent to fit inside world that it's modeling).
Almost no mathematical background is required to follow [Edit: most of ] the proofs. [Edit: But there is a bit of jargon. "Measure" means "probability distribution", and "semimeasure" is a probability distribution that sums to less than one.] We feel our bounds could be made much tighter, and we'd love help investigating that.
These slides (pdf here) are fairly self-contained and a quicker read than the paper itself.






Below, and refer to the probability of the event supposing the demonstrator or imitator were acting the entire time. The limit below refers to successively more unlikely events ; it's not a limit over time. Imagine a sequence of events such that .



Thanks for the post and writeup, and good work! I especially appreciate the short, informal explanation of what makes this work.
Given my current understanding of the proposal, I have one worry which makes me reluctant to share your optimism about this being a solution to inner alignment:
The scheme doesn't protect us if somehow all top-n demonstrator models have correlated errors. This could happen if they are coordinating, or more prosaically if our way to approximate the posterior leads to such correlations. The picture I have in my head for the latter is that we train a big ensemble of neural nets and treat a random sample from that ensemble as a random sample from the posterior, although I don't know if that's how it's actually done.
A lot of the work is done by the assumption that the true demonstrator is in the posterior, which means that at least one of the top-performing models will not have the same correlated errors. But I'm not sure how true this assumption will be in the neural-net approximation I describe above. I worry about inner alignment failures because I don't really trust the neural net prior, and I can imagine training a bunch of neural nets to have correlated weirdnesses about them (in part because of the neural net prior they share, and in part because of things like Adversarial Examples Are Not Bugs, They Are Features). As such it wouldn't be that surprising to me if it turned out that ensembles have certain correlated errors, and in particular don't really represent anything like the demonstrator.
I do feel safer using this method than I would deferring to a single model, so this is still a good idea on balance. I just am not convinced that it solves the inner alignment problem. Instead, I'd say it ameliorates its severity, which may or may not be sufficient.
Yes, I agree that an ensemble of models generated by a neural network may have correlated errors. I only claim to solve the inner alignment problem in theory (i.e. for idealized Bayesian reasoning).