We've written a paper on online imitation learning, and our construction allows us to bound the extent to which mesa-optimizers could accomplish anything. This is not to say it will definitely be easy to eliminate mesa-optimizers in practice, but investigations into how to do so could look here as a starting point. The way to avoid outputting predictions that may have been corrupted by a mesa-optimizer is to ask for help when plausible stochastic models disagree about probabilities.
Here is the abstract:
In imitation learning, imitators and demonstrators are policies for picking actions given past interactions with the environment. If we run an imitator, we probably want events to unfold similarly to the way they would have if the demonstrator had been acting the whole time. No existing work provides formal guidance in how this might be accomplished, instead restricting focus to environments that restart, making learning unusually easy, and conveniently limiting the significance of any mistake. We address a fully general setting, in which the (stochastic) environment and demonstrator never reset, not even for training purposes. Our new conservative Bayesian imitation learner underestimates the probabilities of each available action, and queries for more data with the remaining probability. Our main result: if an event would have been unlikely had the demonstrator acted the whole time, that event's likelihood can be bounded above when running the (initially totally ignorant) imitator instead. Meanwhile, queries to the demonstrator rapidly diminish in frequency.
The second-last sentence refers to the bound on what a mesa-optimizer could accomplish. We assume a realizable setting (positive prior weight on the true demonstrator-model). There are none of the usual embedding problems here—the imitator can just be bigger than the demonstrator that it's modeling.
(As a side note, even if the imitator had to model the whole world, it wouldn't be a big problem theoretically. If the walls of the computer don't in fact break during the operation of the agent, then "the actual world" and "the actual world outside the computer conditioned on the walls of the computer not breaking" both have equal claim to being "the true world-model", in the formal sense that is relevant to a Bayesian agent. And the latter formulation doesn't require the agent to fit inside world that it's modeling).
Almost no mathematical background is required to follow [Edit: most of ] the proofs. [Edit: But there is a bit of jargon. "Measure" means "probability distribution", and "semimeasure" is a probability distribution that sums to less than one.] We feel our bounds could be made much tighter, and we'd love help investigating that.
These slides (pdf here) are fairly self-contained and a quicker read than the paper itself.






Below, and refer to the probability of the event supposing the demonstrator or imitator were acting the entire time. The limit below refers to successively more unlikely events ; it's not a limit over time. Imagine a sequence of events such that .



I haven't read the paper yet, looking forward to it. Using something along these lines to run a sufficiently-faithful simulation of HCH seems like a plausible path to producing an aligned AI with a halting oracle. (I don't think that even solves the problem given a halting oracle, since HCH is probably not aligned, but I still think this would be noteworthy.)
First I'm curious to understand this main result so I know what to look for and how surprised to be. In particular, I have two questions about the quantitative behavior described here:
1: Dependence on the prior of the true hypothesis
It seems like you have the following tough case even if the human is deterministic:
In this setting, it seems like I have no hope other than to query the human on all N decisions (since all days and hypotheses are symmetrical), so I assume that this is what your algorithm would do.
That strongly suggests that the number of queries to the human goes as 1 / p(correct demonstrator), unless you use some other feature of the hypothesis class. But p(correct demonstrator) is probably less than 2−1014, so this constant might not be acceptable. Usually we try to have a logarithmic dependence on p(correct demonstrator) but this doesn't seem possible here.
So as you say, we'd want to condition on on some relevant facts to get up to the point where that probability might be acceptably-high. So then it seems like we have two problems:
(It feels to me like this is going to involve moving to the non-realizable setting at the same time. Vanessa has some comments elsewhere about being able to handle that with infra-Bayesianism, though I am fairly uncertain whether that is actually going to work.)
Does that all seem right?
2: Dependence on the probability bound
Suppose that I want to bound the probability of catastrophe as (1+ϵ) times the demonstrator probability of catastrophe. It seems like the number of human queries must scale at least like 1/ϵ. Is that right, and if so what's the actual dependence on epsilon?
I mostly ask about this because in the context of HCH we may need to push epsilon down to 1/N. But maybe there's some way to avoid that by considering predictors that update on counterfactual demonstrator behavior in the rest of the tree (even though the true demonstrator does not), to get a full bound on the relative probability of a tree under the true demonstrator vs model. I haven't thought about this in years and am curious if you have a take on the feasibility of that or whether you think the entire project is plausible.
I think that not every gap is exploitable. For most types of biases in the prior, it would only promote simulation hypotheses with baseline universes conformant to this bias, and attackers who evolved in such universes will also tend to share this bias, so they will target universes conformant to this bias and that would make them less competitive with the true hypothesis. In other words, most types of bias affect ... (read more)